Image Description

This project aims to explore global causes of death and uncover trends across different regions and time periods using various data analysis and visualization tools. The goal is to provide a comprehensive understanding of mortality patterns that can inform public health strategies.

This dataset contains mortality data related to various diseases from 1990 to 2019, spanning nearly 30 years. It highlights the rise of lifestyle-related illnesses, which have emerged as a consequence of modern advancements, affecting every aspect of life. The recent pandemic has underscored how such health crises can reshape the world, but beyond that, numerous other illnesses continue to impact global society and influence decision-makers. This notebook aims to analyze the global impact of these "new age" diseases using 30 years of historical data.

1. Import Libraries¶

In [485]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Set the plot style
sns.set(style="whitegrid")

2. Data Loading¶

In [486]:
df = pd.read_csv('cause_of_deaths.csv')

3. Initial Data Exploration¶

In [487]:
# Display the first few rows of the dataset
df.head()
Out[487]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
0 Afghanistan AFG 1990 2159 1116 371 2087 93 1370 1538 ... 2108 3709 338 2054 4154 5945 2673 5005 323 2985
1 Afghanistan AFG 1991 2218 1136 374 2153 189 1391 2001 ... 2120 3724 351 2119 4472 6050 2728 5120 332 3092
2 Afghanistan AFG 1992 2475 1162 378 2441 239 1514 2299 ... 2153 3776 386 2404 5106 6223 2830 5335 360 3325
3 Afghanistan AFG 1993 2812 1187 384 2837 108 1687 2589 ... 2195 3862 425 2797 5681 6445 2943 5568 396 3601
4 Afghanistan AFG 1994 3027 1211 391 3081 211 1809 2849 ... 2231 3932 451 3038 6001 6664 3027 5739 420 3816

5 rows × 34 columns

In [488]:
df.columns
Out[488]:
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
       'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis'],
      dtype='object')
In [489]:
df.tail()
Out[489]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
6115 Zimbabwe ZWE 2015 1439 754 215 3019 2518 770 1302 ... 3176 2108 381 2990 2373 2751 1956 4202 632 146
6116 Zimbabwe ZWE 2016 1457 767 219 3056 2050 801 1342 ... 3259 2160 393 3027 2436 2788 1962 4264 648 146
6117 Zimbabwe ZWE 2017 1460 781 223 2990 2116 818 1363 ... 3313 2196 398 2962 2473 2818 2007 4342 654 144
6118 Zimbabwe ZWE 2018 1450 795 227 2918 2088 825 1396 ... 3381 2240 400 2890 2509 2849 2030 4377 657 139
6119 Zimbabwe ZWE 2019 1450 812 232 2884 2068 827 1434 ... 3460 2292 405 2855 2554 2891 2065 4437 662 136

5 rows × 34 columns

Let's explore the country columns

In [490]:
df["Country/Territory"].describe()
Out[490]:
count            6120
unique            204
top       Afghanistan
freq               30
Name: Country/Territory, dtype: object
In [491]:
df["Country/Territory"].value_counts()
Out[491]:
Country/Territory
Afghanistan         30
Papua New Guinea    30
Niue                30
North Korea         30
North Macedonia     30
                    ..
Greenland           30
Grenada             30
Guam                30
Guatemala           30
Zimbabwe            30
Name: count, Length: 204, dtype: int64
In [ ]:
 

4. Descriptive Statistics¶

In [492]:
# Display descriptive statistics of the dataset
df.describe()
Out[492]:
Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence Maternal Disorders HIV/AIDS ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
count 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 ... 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00
mean 2004.50 1719.70 4864.19 1173.17 2253.60 4140.96 1683.33 2083.80 1262.59 5941.90 ... 5138.70 4724.13 425.01 1965.99 5930.80 17092.37 6124.07 10725.27 588.71 618.43
std 8.66 6672.01 18220.66 4616.16 10483.63 18427.75 8877.02 6917.01 6057.97 21011.96 ... 16773.08 16470.43 2022.64 8256.00 24097.78 105157.18 20688.12 37228.05 2128.60 4186.02
min 1990.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00
25% 1997.00 15.00 90.00 27.00 9.00 0.00 34.00 40.00 5.00 11.00 ... 236.00 145.75 6.00 5.00 174.75 289.00 154.00 284.00 17.00 2.00
50% 2004.50 109.00 666.50 164.00 119.00 0.00 177.00 265.00 54.00 136.00 ... 1087.00 822.00 52.50 92.00 966.50 1689.00 1210.00 2185.00 126.00 15.00
75% 2012.00 847.25 2456.25 609.25 1167.25 393.00 698.00 877.00 734.00 1879.00 ... 2954.00 2922.50 254.00 1042.50 3435.25 5249.75 3547.25 6080.00 450.00 160.00
max 2019.00 98358.00 320715.00 76990.00 268223.00 280604.00 153773.00 69640.00 107929.00 305491.00 ... 273089.00 222922.00 30883.00 202241.00 329237.00 1366039.00 270037.00 464914.00 25876.00 64305.00

8 rows × 32 columns

In [493]:
df.describe().T
Out[493]:
count mean std min 25% 50% 75% max
Year 6120.00 2004.50 8.66 1990.00 1997.00 2004.50 2012.00 2019.00
Meningitis 6120.00 1719.70 6672.01 0.00 15.00 109.00 847.25 98358.00
Alzheimer's Disease and Other Dementias 6120.00 4864.19 18220.66 0.00 90.00 666.50 2456.25 320715.00
Parkinson's Disease 6120.00 1173.17 4616.16 0.00 27.00 164.00 609.25 76990.00
Nutritional Deficiencies 6120.00 2253.60 10483.63 0.00 9.00 119.00 1167.25 268223.00
Malaria 6120.00 4140.96 18427.75 0.00 0.00 0.00 393.00 280604.00
Drowning 6120.00 1683.33 8877.02 0.00 34.00 177.00 698.00 153773.00
Interpersonal Violence 6120.00 2083.80 6917.01 0.00 40.00 265.00 877.00 69640.00
Maternal Disorders 6120.00 1262.59 6057.97 0.00 5.00 54.00 734.00 107929.00
HIV/AIDS 6120.00 5941.90 21011.96 0.00 11.00 136.00 1879.00 305491.00
Drug Use Disorders 6120.00 434.01 2898.76 0.00 3.00 20.00 129.00 65717.00
Tuberculosis 6120.00 7491.93 39549.98 0.00 35.00 417.00 2924.25 657515.00
Cardiovascular Diseases 6120.00 73160.45 291577.54 4.00 2028.00 11742.00 42546.50 4584273.00
Lower Respiratory Infections 6120.00 13687.91 48031.72 0.00 345.00 2126.50 10161.25 690913.00
Neonatal Disorders 6120.00 12558.94 56058.37 0.00 131.00 916.00 7419.75 852761.00
Alcohol Use Disorders 6120.00 787.42 3545.82 0.00 9.00 80.00 316.00 55200.00
Self-harm 6120.00 3874.83 18425.62 0.00 94.00 533.00 1882.25 220357.00
Exposure to Forces of Nature 6120.00 243.49 4717.10 0.00 0.00 0.00 12.00 222641.00
Diarrheal Diseases 6120.00 10822.80 65416.17 0.00 20.00 296.50 3946.75 1119477.00
Environmental Heat and Cold Exposure 6120.00 292.30 1704.47 0.00 2.00 21.00 109.00 29048.00
Neoplasms 6120.00 37542.24 161558.37 1.00 809.75 5629.50 20147.75 2716551.00
Conflict and Terrorism 6120.00 538.24 7033.31 0.00 0.00 0.00 23.00 503532.00
Diabetes Mellitus 6120.00 5138.70 16773.08 1.00 236.00 1087.00 2954.00 273089.00
Chronic Kidney Disease 6120.00 4724.13 16470.43 0.00 145.75 822.00 2922.50 222922.00
Poisonings 6120.00 425.01 2022.64 0.00 6.00 52.50 254.00 30883.00
Protein-Energy Malnutrition 6120.00 1965.99 8256.00 0.00 5.00 92.00 1042.50 202241.00
Road Injuries 6120.00 5930.80 24097.78 0.00 174.75 966.50 3435.25 329237.00
Chronic Respiratory Diseases 6120.00 17092.37 105157.18 1.00 289.00 1689.00 5249.75 1366039.00
Cirrhosis and Other Chronic Liver Diseases 6120.00 6124.07 20688.12 0.00 154.00 1210.00 3547.25 270037.00
Digestive Diseases 6120.00 10725.27 37228.05 0.00 284.00 2185.00 6080.00 464914.00
Fire, Heat, and Hot Substances 6120.00 588.71 2128.60 0.00 17.00 126.00 450.00 25876.00
Acute Hepatitis 6120.00 618.43 4186.02 0.00 2.00 15.00 160.00 64305.00
In [494]:
# Get the Statistical summary of the category columns

df.describe(include='object').T
Out[494]:
count unique top freq
Country/Territory 6120 204 Afghanistan 30
Code 6120 204 AFG 30
In [ ]:
 

5. Data Types and Missing Values¶

In [495]:
# Display information about data types and missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6120 entries, 0 to 6119
Data columns (total 34 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   Country/Territory                           6120 non-null   object
 1   Code                                        6120 non-null   object
 2   Year                                        6120 non-null   int64 
 3   Meningitis                                  6120 non-null   int64 
 4   Alzheimer's Disease and Other Dementias     6120 non-null   int64 
 5   Parkinson's Disease                         6120 non-null   int64 
 6   Nutritional Deficiencies                    6120 non-null   int64 
 7   Malaria                                     6120 non-null   int64 
 8   Drowning                                    6120 non-null   int64 
 9   Interpersonal Violence                      6120 non-null   int64 
 10  Maternal Disorders                          6120 non-null   int64 
 11  HIV/AIDS                                    6120 non-null   int64 
 12  Drug Use Disorders                          6120 non-null   int64 
 13  Tuberculosis                                6120 non-null   int64 
 14  Cardiovascular Diseases                     6120 non-null   int64 
 15  Lower Respiratory Infections                6120 non-null   int64 
 16  Neonatal Disorders                          6120 non-null   int64 
 17  Alcohol Use Disorders                       6120 non-null   int64 
 18  Self-harm                                   6120 non-null   int64 
 19  Exposure to Forces of Nature                6120 non-null   int64 
 20  Diarrheal Diseases                          6120 non-null   int64 
 21  Environmental Heat and Cold Exposure        6120 non-null   int64 
 22  Neoplasms                                   6120 non-null   int64 
 23  Conflict and Terrorism                      6120 non-null   int64 
 24  Diabetes Mellitus                           6120 non-null   int64 
 25  Chronic Kidney Disease                      6120 non-null   int64 
 26  Poisonings                                  6120 non-null   int64 
 27  Protein-Energy Malnutrition                 6120 non-null   int64 
 28  Road Injuries                               6120 non-null   int64 
 29  Chronic Respiratory Diseases                6120 non-null   int64 
 30  Cirrhosis and Other Chronic Liver Diseases  6120 non-null   int64 
 31  Digestive Diseases                          6120 non-null   int64 
 32  Fire, Heat, and Hot Substances              6120 non-null   int64 
 33  Acute Hepatitis                             6120 non-null   int64 
dtypes: int64(32), object(2)
memory usage: 1.6+ MB
In [496]:
# Check for missing values
df.isnull().sum()
Out[496]:
Country/Territory                             0
Code                                          0
Year                                          0
Meningitis                                    0
Alzheimer's Disease and Other Dementias       0
Parkinson's Disease                           0
Nutritional Deficiencies                      0
Malaria                                       0
Drowning                                      0
Interpersonal Violence                        0
Maternal Disorders                            0
HIV/AIDS                                      0
Drug Use Disorders                            0
Tuberculosis                                  0
Cardiovascular Diseases                       0
Lower Respiratory Infections                  0
Neonatal Disorders                            0
Alcohol Use Disorders                         0
Self-harm                                     0
Exposure to Forces of Nature                  0
Diarrheal Diseases                            0
Environmental Heat and Cold Exposure          0
Neoplasms                                     0
Conflict and Terrorism                        0
Diabetes Mellitus                             0
Chronic Kidney Disease                        0
Poisonings                                    0
Protein-Energy Malnutrition                   0
Road Injuries                                 0
Chronic Respiratory Diseases                  0
Cirrhosis and Other Chronic Liver Diseases    0
Digestive Diseases                            0
Fire, Heat, and Hot Substances                0
Acute Hepatitis                               0
dtype: int64

6. Duplicates Check¶

In [497]:
# Check for duplicate rows
df.duplicated().sum()
Out[497]:
0

7. Data Types Overview¶

In [498]:
# Display the data types of each column
df.dtypes
Out[498]:
Country/Territory                             object
Code                                          object
Year                                           int64
Meningitis                                     int64
Alzheimer's Disease and Other Dementias        int64
Parkinson's Disease                            int64
Nutritional Deficiencies                       int64
Malaria                                        int64
Drowning                                       int64
Interpersonal Violence                         int64
Maternal Disorders                             int64
HIV/AIDS                                       int64
Drug Use Disorders                             int64
Tuberculosis                                   int64
Cardiovascular Diseases                        int64
Lower Respiratory Infections                   int64
Neonatal Disorders                             int64
Alcohol Use Disorders                          int64
Self-harm                                      int64
Exposure to Forces of Nature                   int64
Diarrheal Diseases                             int64
Environmental Heat and Cold Exposure           int64
Neoplasms                                      int64
Conflict and Terrorism                         int64
Diabetes Mellitus                              int64
Chronic Kidney Disease                         int64
Poisonings                                     int64
Protein-Energy Malnutrition                    int64
Road Injuries                                  int64
Chronic Respiratory Diseases                   int64
Cirrhosis and Other Chronic Liver Diseases     int64
Digestive Diseases                             int64
Fire, Heat, and Hot Substances                 int64
Acute Hepatitis                                int64
dtype: object
In [499]:
# Total no.of records

len(df)
Out[499]:
6120
In [500]:
df.shape
Out[500]:
(6120, 34)
In [501]:
df['Year'].nunique()
# no. of years
Out[501]:
30
In [502]:
# Unique no. years
df['Year'].unique()
Out[502]:
array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], dtype=int64)
In [503]:
# Check for number of unique records present in the data

df.nunique(axis = 0)
Out[503]:
Country/Territory                              204
Code                                           204
Year                                            30
Meningitis                                    2020
Alzheimer's Disease and Other Dementias       3037
Parkinson's Disease                           1817
Nutritional Deficiencies                      2147
Malaria                                       1723
Drowning                                      1875
Interpersonal Violence                        2142
Maternal Disorders                            1818
HIV/AIDS                                      2412
Drug Use Disorders                             876
Tuberculosis                                  2843
Cardiovascular Diseases                       5225
Lower Respiratory Infections                  4106
Neonatal Disorders                            3553
Alcohol Use Disorders                         1287
Self-harm                                     2758
Exposure to Forces of Nature                   478
Diarrheal Diseases                            2874
Environmental Heat and Cold Exposure           714
Neoplasms                                     4814
Conflict and Terrorism                         918
Diabetes Mellitus                             3366
Chronic Kidney Disease                        3246
Poisonings                                    1087
Protein-Energy Malnutrition                   2091
Road Injuries                                 3393
Chronic Respiratory Diseases                  3803
Cirrhosis and Other Chronic Liver Diseases    3443
Digestive Diseases                            4023
Fire, Heat, and Hot Substances                1406
Acute Hepatitis                               1059
dtype: int64
In [504]:
df.head()
Out[504]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
0 Afghanistan AFG 1990 2159 1116 371 2087 93 1370 1538 ... 2108 3709 338 2054 4154 5945 2673 5005 323 2985
1 Afghanistan AFG 1991 2218 1136 374 2153 189 1391 2001 ... 2120 3724 351 2119 4472 6050 2728 5120 332 3092
2 Afghanistan AFG 1992 2475 1162 378 2441 239 1514 2299 ... 2153 3776 386 2404 5106 6223 2830 5335 360 3325
3 Afghanistan AFG 1993 2812 1187 384 2837 108 1687 2589 ... 2195 3862 425 2797 5681 6445 2943 5568 396 3601
4 Afghanistan AFG 1994 3027 1211 391 3081 211 1809 2849 ... 2231 3932 451 3038 6001 6664 3027 5739 420 3816

5 rows × 34 columns

In [505]:
# Correlation of various causes of death against year

# Select only numeric columns
numeric_df = df.select_dtypes(include=[float, int])

# Compute the correlation matrix
correlation_matrix = numeric_df.corr()

# Display correlation of all numeric columns with 'Year'
correlation_with_year = correlation_matrix['Year']
print(correlation_with_year)
Year                                          1.00
Meningitis                                   -0.04
Alzheimer's Disease and Other Dementias       0.08
Parkinson's Disease                           0.07
Nutritional Deficiencies                     -0.08
Malaria                                      -0.02
Drowning                                     -0.04
Interpersonal Violence                       -0.00
Maternal Disorders                           -0.03
HIV/AIDS                                      0.02
Drug Use Disorders                            0.02
Tuberculosis                                 -0.03
Cardiovascular Diseases                       0.03
Lower Respiratory Infections                 -0.03
Neonatal Disorders                           -0.03
Alcohol Use Disorders                         0.01
Self-harm                                    -0.00
Exposure to Forces of Nature                 -0.01
Diarrheal Diseases                           -0.03
Environmental Heat and Cold Exposure         -0.02
Neoplasms                                     0.04
Conflict and Terrorism                       -0.01
Diabetes Mellitus                             0.07
Chronic Kidney Disease                        0.07
Poisonings                                   -0.01
Protein-Energy Malnutrition                  -0.09
Road Injuries                                 0.01
Chronic Respiratory Diseases                  0.01
Cirrhosis and Other Chronic Liver Diseases    0.03
Digestive Diseases                            0.03
Fire, Heat, and Hot Substances               -0.01
Acute Hepatitis                              -0.03
Name: Year, dtype: float64

1: positive correlation (as Year increases, the cause of death increases).

-1: negative correlation (as Year increases, the cause of death decreases).

0: No correlation (no relationship between Year and the cause of death).

In [506]:
# Total no.of Countries

df['Country/Territory'].nunique()
Out[506]:
204
In [507]:
# Total no.of year data provided for each country

df['Country/Territory'].value_counts()
Out[507]:
Country/Territory
Afghanistan         30
Papua New Guinea    30
Niue                30
North Korea         30
North Macedonia     30
                    ..
Greenland           30
Grenada             30
Guam                30
Guatemala           30
Zimbabwe            30
Name: count, Length: 204, dtype: int64

30 year data is provided for Each Country

In [508]:
df["Year"].value_counts()
Out[508]:
Year
1990    204
1991    204
2018    204
2017    204
2016    204
2015    204
2014    204
2013    204
2012    204
2011    204
2010    204
2009    204
2008    204
2007    204
2006    204
2005    204
2004    204
2003    204
2002    204
2001    204
2000    204
1999    204
1998    204
1997    204
1996    204
1995    204
1994    204
1993    204
1992    204
2019    204
Name: count, dtype: int64

Which country has the highest number of deaths due to cardiovascular diseases?

In [509]:
car_disease = df.groupby("Country/Territory")["Cardiovascular Diseases"].sum().sort_values(ascending=False).head(20)
car_disease
Out[509]:
Country/Territory
China             100505973
India              52994710
Russia             33903781
United States      26438346
Indonesia          13587011
Ukraine            13053052
Germany            10819770
Brazil              9589019
Japan               9210437
Pakistan            7745192
Italy               6614384
United Kingdom      6603062
Bangladesh          6123691
Egypt               5995471
Vietnam             5323920
Poland              5233134
France              4729313
Romania             4474916
Nigeria             4176488
Turkey              4167835
Name: Cardiovascular Diseases, dtype: int64

Global Causes of Death Distribution of the top causes of death.:

In [510]:
# Summing the causes of death across all countries and years
global_causes = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)

# Plot the top 10 causes of death globally
plt.figure(figsize=(12,6))
sns.barplot(x=global_causes.index[:10], y=global_causes.values[:10], palette='Blues_d')
plt.xticks(rotation=90)
plt.title('Top 10 Global Causes of Death')
plt.ylabel('Total Deaths')
plt.xlabel('Cause of Death')
plt.show()

Trend Analysis Over Time (how a specific cause of death (e.g., Cardiovascular Diseases) has changed over time.):

In [511]:
# Group by year and sum up deaths for Cardiovascular Diseases
cardio_trend = df.groupby('Year')['Cardiovascular Diseases'].sum()

# Plotting the trend
plt.figure(figsize=(10,5))
sns.lineplot(x=cardio_trend.index, y=cardio_trend.values)
plt.title('Cardiovascular Diseases Trend Over Time')
plt.ylabel('Total Deaths')
plt.xlabel('Year')
plt.show()

Regional Comparison (compare regions by creating a bar chart to visualize how causes of death differ across different countries.):

In [512]:
# Group by Country and sum up deaths for all causes
regional_comparison = df.groupby('Country/Territory').sum().drop(columns=['Year'])

# Sort by Cardiovascular Diseases
regional_comparison = regional_comparison.sort_values(by='Cardiovascular Diseases', ascending=False)

# Plotting the top 10 countries for Cardiovascular Diseases
plt.figure(figsize=(12,6))
sns.barplot(x=regional_comparison.index[:10], y=regional_comparison['Cardiovascular Diseases'][:10], palette='Reds_d')
plt.xticks(rotation=90)
plt.title('Top 10 Countries with Cardiovascular Diseases Deaths')
plt.ylabel('Total Deaths')
plt.xlabel('Country/Territory')
plt.show()
In [513]:
plt.figure(figsize=(12, 6))
sns.barplot(x=car_disease, y=car_disease.index, palette='viridis', orient='h')
plt.title('Sum of Cardiovascular Diseases by Country')
plt.xlabel('Country/Territory')
plt.ylabel('Sum of Cardiovascular Diseases')
plt.xticks(ha='right')

plt.show()
In [514]:
for country in car_disease.index[:5] :
    selected_country_data = df[df['Country/Territory'] == country]
    sns.lineplot(x='Year', y='Cardiovascular Diseases', data=selected_country_data, label=country, marker='o', markersize=8)
plt.title('Disease Counts Over Years - Top 5 Countries')
plt.xlabel('Year')
plt.ylabel('Cardiovascular Diseases Count')
plt.legend(title='Country/Territory', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
In [515]:
disease_data = df.groupby("Year")["Cardiovascular Diseases"].sum().reset_index()
disease_data
Out[515]:
Year Cardiovascular Diseases
0 1990 12062179
1 1991 12220282
2 1992 12437979
3 1993 12802108
4 1994 13026289
5 1995 13129252
6 1996 13213565
7 1997 13339902
8 1998 13461489
9 1999 13720763
10 2000 13957078
11 2001 14185571
12 2002 14501696
13 2003 14710723
14 2004 14745985
15 2005 14995528
16 2006 14991661
17 2007 15117363
18 2008 15402070
19 2009 15552545
20 2010 15838151
21 2011 16038263
22 2012 16245243
23 2013 16490053
24 2014 16715810
25 2015 17089707
26 2016 17398709
27 2017 17685890
28 2018 18113910
29 2019 18552218
In [516]:
sns.lineplot(x='Year', y='Cardiovascular Diseases', data=disease_data, marker='o', markersize=8);

What is the trend in the number of deaths caused by Alzheimer’s disease and other dementias over the years?

In [517]:
Dementias_trend = df.groupby("Year")["Alzheimer's Disease and Other Dementias"].sum().reset_index()
Dementias_trend
Out[517]:
Year Alzheimer's Disease and Other Dementias
0 1990 560616
1 1991 583166
2 1992 605894
3 1993 629571
4 1994 652176
5 1995 674815
6 1996 696665
7 1997 717342
8 1998 738768
9 1999 761620
10 2000 786615
11 2001 814526
12 2002 845695
13 2003 877011
14 2004 909148
15 2005 945619
16 2006 982308
17 2007 1022057
18 2008 1065297
19 2009 1109405
20 2010 1155944
21 2011 1201138
22 2012 1247515
23 2013 1294701
24 2014 1343756
25 2015 1394942
26 2016 1451840
27 2017 1509646
28 2018 1568617
29 2019 1622426
In [518]:
sns.lineplot(x='Year', y="Alzheimer's Disease and Other Dementias", data=Dementias_trend, marker='o', markersize=8);

Which country has the highest number of deaths caused by malaria?

In [519]:
malaria_disease = df.groupby("Country/Territory")["Malaria"].sum().sort_values(ascending=False).head(20)
malaria_disease
Out[519]:
Country/Territory
Nigeria                         6422063
Democratic Republic of Congo    2557219
India                           2439244
Uganda                          1265629
Burkina Faso                     950762
Cote d'Ivoire                    941597
Mozambique                       817948
Tanzania                         800490
Ghana                            721339
Mali                             711087
Niger                            693962
Cameroon                         614095
Ethiopia                         453985
Malawi                           404288
Sierra Leone                     394491
Guinea                           362660
Bangladesh                       349375
Burundi                          320767
Angola                           317069
Benin                            316834
Name: Malaria, dtype: int64
In [520]:
plt.figure(figsize=(12, 6))
sns.barplot(x=malaria_disease, y=malaria_disease.index, palette='viridis', orient='h')
plt.title('Sum of Malaria Diseases by Country')
plt.xlabel('Country/Territory')
plt.ylabel('Sum of Malaria Diseases')
plt.xticks(ha='right')

plt.show()

All coutries suffering from Malaria are in Africa except for India and Bangladesh, which makes sense

In [521]:
df.groupby("Country/Territory").sum()
Out[521]:
Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence Maternal Disorders ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
Country/Territory
Afghanistan AFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGA... 60135 78666 41998 13397 71453 13924 56536 108228 129621 ... 93207 134676 14530 70163 208331 209857 98419 186959 13559 98108
Albania ALBALBALBALBALBALBALBALBALBALBALBALBALBALBALBA... 60135 1323 16549 4491 569 0 2397 5242 246 ... 4055 7636 500 526 8522 22632 8717 14907 636 44
Algeria DZADZADZADZADZADZADZADZADZADZADZADZADZADZADZAD... 60135 15685 86914 22943 7138 70 24273 16702 29475 ... 89035 154666 12337 6407 369395 168453 91927 146527 27628 10492
American Samoa ASMASMASMASMASMASMASMASMASMASMASMASMASMASMASMA... 60135 30 143 69 60 0 120 101 30 ... 970 512 0 60 164 612 181 341 0 0
Andorra ANDANDANDANDANDANDANDANDANDANDANDANDANDANDANDA... 60135 0 614 137 0 0 0 15 0 ... 198 292 0 0 259 838 283 560 0 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Venezuela VENVENVENVENVENVENVENVENVENVENVENVENVENVENVENV... 60135 11615 108735 18573 22554 3726 20273 266071 12739 ... 175790 161667 2607 21347 175036 122198 91720 168365 4949 1109
Vietnam VNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMV... 60135 38559 369363 83322 48613 17311 214356 47981 13167 ... 544222 396874 34681 7366 594980 911787 527192 735817 17380 30650
Yemen YEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMY... 60135 21095 31045 7188 68939 143463 27994 17918 53611 ... 30812 52119 12561 66731 278327 126525 64136 111536 23871 26532
Zambia ZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZ... 60135 98886 13473 4054 95913 205529 12809 30065 28395 ... 54098 41751 9056 92915 56976 59173 100581 147640 9476 8846
Zimbabwe ZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZ... 60135 41238 20017 5764 66723 118728 18169 32741 29802 ... 71175 49952 9113 65942 67207 71774 55027 108691 14718 3778

204 rows × 33 columns

The high mortality rate in Nigeria is a result of its weak health systems and poverty

What is the percentage of deaths caused by lower respiratory infections in the total number of deaths?

In [522]:
deaths_causes = df.iloc[:, 3:].sum().sort_values(ascending = False)
deaths_causes_per = (deaths_causes.div(deaths_causes.sum()) * 100).round(2)
deaths_causes_per
Out[522]:
Cardiovascular Diseases                      30.50
Neoplasms                                    15.65
Chronic Respiratory Diseases                  7.13
Lower Respiratory Infections                  5.71
Neonatal Disorders                            5.24
Diarrheal Diseases                            4.51
Digestive Diseases                            4.47
Tuberculosis                                  3.12
Cirrhosis and Other Chronic Liver Diseases    2.55
HIV/AIDS                                      2.48
Road Injuries                                 2.47
Diabetes Mellitus                             2.14
Alzheimer's Disease and Other Dementias       2.03
Chronic Kidney Disease                        1.97
Malaria                                       1.73
Self-harm                                     1.62
Nutritional Deficiencies                      0.94
Interpersonal Violence                        0.87
Protein-Energy Malnutrition                   0.82
Meningitis                                    0.72
Drowning                                      0.70
Maternal Disorders                            0.53
Parkinson's Disease                           0.49
Alcohol Use Disorders                         0.33
Acute Hepatitis                               0.26
Fire, Heat, and Hot Substances                0.25
Conflict and Terrorism                        0.22
Drug Use Disorders                            0.18
Poisonings                                    0.18
Environmental Heat and Cold Exposure          0.12
Exposure to Forces of Nature                  0.10
dtype: float64

Heart and Circulatory Diseases, Tumors and Respiratory Diseases constitutes more than 60% of the total number of deaths around the world

In [523]:
plt.figure(figsize=(12, 10))
sns.barplot(x=deaths_causes, y=deaths_causes.index, palette='viridis', orient='h')
plt.title('Causes of Death')
plt.xlabel('Counts')
plt.xticks(ha='right')

plt.show()
In [ ]:
 
In [ ]:
 

Feature Engineering¶

In [524]:
df.head()
Out[524]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
0 Afghanistan AFG 1990 2159 1116 371 2087 93 1370 1538 ... 2108 3709 338 2054 4154 5945 2673 5005 323 2985
1 Afghanistan AFG 1991 2218 1136 374 2153 189 1391 2001 ... 2120 3724 351 2119 4472 6050 2728 5120 332 3092
2 Afghanistan AFG 1992 2475 1162 378 2441 239 1514 2299 ... 2153 3776 386 2404 5106 6223 2830 5335 360 3325
3 Afghanistan AFG 1993 2812 1187 384 2837 108 1687 2589 ... 2195 3862 425 2797 5681 6445 2943 5568 396 3601
4 Afghanistan AFG 1994 3027 1211 391 3081 211 1809 2849 ... 2231 3932 451 3038 6001 6664 3027 5739 420 3816

5 rows × 34 columns

In [ ]:
 
In [525]:
cause_of_deaths = ['Meningitis',
       'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis']
In [526]:
# Creating a new column for 'Total_no_of_Deaths' for individual Country and Year

df['Total_no_of_Deaths'] = df[cause_of_deaths].sum(axis=1)
In [527]:
# Top 10 Total_no_of_Deaths

top10_Total_no_of_Deaths = df.sort_values(by='Total_no_of_Deaths',ascending=False)[:10][['Total_no_of_Deaths','Country/Territory']]

top10_Total_no_of_Deaths
Out[527]:
Total_no_of_Deaths Country/Territory
1139 10442561 China
1138 10163943 China
1137 9978653 China
1119 9814213 China
1118 9591222 China
1117 9503904 China
1116 9411928 China
1114 9366974 China
1115 9364587 China
1113 9284664 China
In [ ]:
 

8. Exploratory Data Analysis (EDA)¶

In [528]:
# Display descriptive statistics again for reference
df.describe()
Out[528]:
Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence Maternal Disorders HIV/AIDS ... Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis Total_no_of_Deaths
count 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 ... 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00
mean 2004.50 1719.70 4864.19 1173.17 2253.60 4140.96 1683.33 2083.80 1262.59 5941.90 ... 4724.13 425.01 1965.99 5930.80 17092.37 6124.07 10725.27 588.71 618.43 239891.29
std 8.66 6672.01 18220.66 4616.16 10483.63 18427.75 8877.02 6917.01 6057.97 21011.96 ... 16470.43 2022.64 8256.00 24097.78 105157.18 20688.12 37228.05 2128.60 4186.02 873713.89
min 1990.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 7.00
25% 1997.00 15.00 90.00 27.00 9.00 0.00 34.00 40.00 5.00 11.00 ... 145.75 6.00 5.00 174.75 289.00 154.00 284.00 17.00 2.00 6935.00
50% 2004.50 109.00 666.50 164.00 119.00 0.00 177.00 265.00 54.00 136.00 ... 822.00 52.50 92.00 966.50 1689.00 1210.00 2185.00 126.00 15.00 50257.50
75% 2012.00 847.25 2456.25 609.25 1167.25 393.00 698.00 877.00 734.00 1879.00 ... 2922.50 254.00 1042.50 3435.25 5249.75 3547.25 6080.00 450.00 160.00 158221.00
max 2019.00 98358.00 320715.00 76990.00 268223.00 280604.00 153773.00 69640.00 107929.00 305491.00 ... 222922.00 30883.00 202241.00 329237.00 1366039.00 270037.00 464914.00 25876.00 64305.00 10442561.00

8 rows × 33 columns

Data Distribution: The large differences between mean and median values, along with the high standard deviations and large range, suggest that the data is highly skewed or has extreme outliers.

Variation: The standard deviations are very large for most causes of death, indicating substantial variability in the number of deaths recorded.

Range: The minimum and maximum values provide insight into the range of deaths recorded. eg, the minimum value for Meningitis is 0, and the maximum is 98,358, showing a wide range.

In [529]:
# Plot the distribution of a specific cause of death
plt.figure(figsize=(10, 6))
sns.histplot(df['Meningitis'], bins=30, kde=True)
plt.title('Distribution of Meningitis Deaths')
plt.xlabel('Number of Deaths')
plt.ylabel('Frequency')
plt.show()
In [530]:
# Boxplot for various causes of death
plt.figure(figsize=(14, 8))
sns.boxplot(data=df[['Meningitis', "Alzheimer's Disease and Other Dementias", "Parkinson's Disease"]])
plt.title('Boxplot of Selected Causes of Death')
plt.xlabel('Cause of Death')
plt.ylabel('Number of Deaths')
plt.xticks(rotation=45)
plt.show()
In [531]:
# Find the total number of each disease 

disease_df = df[cause_of_deaths].sum().to_frame().reset_index()

disease_df.rename(columns = {'index': 'Diseases', 0:'Total_deaths'}, inplace = True)

disease_df
Out[531]:
Diseases Total_deaths
0 Meningitis 10524572
1 Alzheimer's Disease and Other Dementias 29768839
2 Parkinson's Disease 7179795
3 Nutritional Deficiencies 13792032
4 Malaria 25342676
5 Drowning 10301999
6 Interpersonal Violence 12752839
7 Maternal Disorders 7727046
8 HIV/AIDS 36364419
9 Drug Use Disorders 2656121
10 Tuberculosis 45850603
11 Cardiovascular Diseases 447741982
12 Lower Respiratory Infections 83770038
13 Neonatal Disorders 76860729
14 Alcohol Use Disorders 4819018
15 Self-harm 23713931
16 Exposure to Forces of Nature 1490132
17 Diarrheal Diseases 66235508
18 Environmental Heat and Cold Exposure 1788851
19 Neoplasms 229758538
20 Conflict and Terrorism 3294053
21 Diabetes Mellitus 31448872
22 Chronic Kidney Disease 28911692
23 Poisonings 2601082
24 Protein-Energy Malnutrition 12031885
25 Road Injuries 36296469
26 Chronic Respiratory Diseases 104605334
27 Cirrhosis and Other Chronic Liver Diseases 37479321
28 Digestive Diseases 65638635
29 Fire, Heat, and Hot Substances 3602914
30 Acute Hepatitis 3784791
In [532]:
# Create a Treemap 
import plotly.express as px
fig = px.treemap(disease_df, 
                 path = [px.Constant('Total_deaths'), 'Diseases'], 
                 values = 'Total_deaths'
                 )

# Add some text for labels, title 
fig.update_traces(textinfo='label+percent parent')    
fig.update_layout(title_text='Percentage of cause of deaths around the world during 1990-2019', title_x=0.5, font_size=15)
fig.show()
In [533]:
df.columns
Out[533]:
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
       'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
       'Total_no_of_Deaths'],
      dtype='object')
In [ ]:
 
In [ ]:
 

Total Number of Deaths Around the World¶

In [534]:
# Find the total number of deaths group by country

country_df = df.groupby('Country/Territory')['Total_no_of_Deaths'].sum().sort_values(ascending=False).reset_index()

country_df
Out[534]:
Country/Territory Total_no_of_Deaths
0 China 265408106
1 India 238158165
2 United States 71197802
3 Russia 59591155
4 Indonesia 44046941
... ... ...
199 Cook Islands 3999
200 Tuvalu 2962
201 Nauru 2249
202 Niue 591
203 Tokelau 299

204 rows × 2 columns

In [ ]:
 
In [535]:
# Find the Top 10 total number of deaths group by country.

Top10_countries = df.groupby('Country/Territory')['Total_no_of_Deaths'].sum().sort_values(ascending=False).head(10).reset_index()

Top10_countries
Out[535]:
Country/Territory Total_no_of_Deaths
0 China 265408106
1 India 238158165
2 United States 71197802
3 Russia 59591155
4 Indonesia 44046941
5 Nigeria 43670014
6 Pakistan 38151878
7 Brazil 32674112
8 Japan 31922807
9 Germany 25559667
In [536]:
# Create a bar chart of Top 10 countries with the highest number of deaths

plt.figure(figsize = (12,8))

sns.barplot(data = Top10_countries, x = 'Country/Territory', y = 'Total_no_of_Deaths', color = 'Blue')

# Add some text for labels, title 
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Total Number of Deaths', fontsize = 12)
plt.title('Top 10 countries with the highest number of deaths', fontsize =15)
Out[536]:
Text(0.5, 1.0, 'Top 10 countries with the highest number of deaths')
In [ ]:
 
In [537]:
# Find the Top 10 Countries with the LOWEST number of deaths

Low10_countries = df.groupby('Country/Territory')['Total_no_of_Deaths'].sum().sort_values(ascending=True).head(10).reset_index()

Low10_countries
Out[537]:
Country/Territory Total_no_of_Deaths
0 Tokelau 299
1 Niue 591
2 Nauru 2249
3 Tuvalu 2962
4 Cook Islands 3999
5 Palau 4814
6 San Marino 6761
7 Northern Mariana Islands 7827
8 American Samoa 8619
9 Marshall Islands 10186
In [538]:
plt.figure(figsize=(12,8))

sns.barplot(data = Low10_countries, x = 'Country/Territory', y = 'Total_no_of_Deaths', color = 'Blue')

# Add some text for labels, title 
plt.xticks(rotation = 90)
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Total Number of Deaths', fontsize = 12)
plt.title('Top 10 Countries with the lowest number of deaths', fontsize =15)
Out[538]:
Text(0.5, 1.0, 'Top 10 Countries with the lowest number of deaths')
In [539]:
# A Treemap for the Percentage of Total Number of Deaths group by country

fig = px.treemap(country_df, 
                 path = [px.Constant('Total_no_of_Deaths'), 'Country/Territory'], 
                 values = 'Total_no_of_Deaths'
                 )

# Add some text for labels, title 
fig.update_traces(textinfo='label+percent parent')    
fig.update_layout(title_text='Percentage of total number of deaths around the world', title_x=0.5, font_size=15)
fig.show()

CHINA¶

In [540]:
# Group by country and sum the deaths for each cause
country_cause_deaths = df.groupby('Country/Territory')[cause_of_deaths].sum()

# Visualize the top causes of death for a specific country (e.g., 'China')
country_cause_deaths.loc['China'].plot(kind='bar', figsize=(12, 6), color='navy')
plt.title('Causes of Death in China (1990-2019)')
plt.ylabel('Total Number of Deaths')
plt.xticks(rotation=90)
plt.show()
In [541]:
# China - "Total_no_of_Deaths" against "Year"

China_Total_no_of_Deaths_df = df[df['Country/Territory']=='China'].sort_values(by='Total_no_of_Deaths',ascending=False)
In [542]:
# China - "Total_no_of_Deaths" against "Year"

plt.figure(figsize=(8,4),dpi=200)
sns.scatterplot(data=China_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for China")
plt.show();
In [543]:
plt.figure(figsize=(15,8),dpi=200)
sns.barplot(data=China_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for China")
plt.show();

NOTE:clear raise in Total No.of Deaths recorded with each year for China.

Common Cause of death

In [544]:
plt.figure(figsize=(12,8),dpi=200)
china_df = China_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Malaria'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Malaria Deaths")
plt.title("Year Vs. Malaria Deaths in china")
plt.show();

rapid drop in Malaria Deaths recorded in China after 1999.

In [ ]:
 
In [545]:
plt.figure(figsize=(12,8),dpi=200)
china_df = China_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Nutritional Deficiencies'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Nutritional Deficiencies Deaths")
plt.title("Year Vs. Nutritional Deficiencies Deaths in china")
plt.show();

drop in Nutritional Deficiencies Deaths recorded in China in 2007 and from 2008 the count of deaths again started to raise.

In [546]:
plt.figure(figsize=(12,8),dpi=200)
china_df = China_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Interpersonal Violence'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Interpersonal Violence Deaths")
plt.title("Year Vs. Interpersonal Violence Deaths in china")
plt.show();

continual drop in Interpersonal Violence Deaths recorded in China.

INDIA¶

In [547]:
# India - "Total_no_of_Deaths" against "Year"

India_Total_no_of_Deaths_df = df[df['Country/Territory']=='India'].sort_values(by='Total_no_of_Deaths',ascending=False)
In [548]:
# India - "Total_no_of_Deaths" against "Year"

plt.figure(figsize=(8,4),dpi=200)
sns.scatterplot(data=India_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for India")
plt.show();
In [549]:
plt.figure(figsize=(15,8),dpi=200)
sns.barplot(data=India_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for China")
plt.show();

Overall there is a raise in Total No.of Deaths recorded with each year for India, even though there are fluctuations inbetween.

Common Causes of Death

In [550]:
plt.figure(figsize=(12,8),dpi=200)
china_df = India_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Malaria'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Malaria Deaths")
plt.title("Year Vs. Malaria Deaths in India")
plt.show();

There is a rapid drop in Malaria Deaths recorded in India from 1990, but the Deaths in 2018 and 2019 is greater than that of 2016 and 2017.

Top 3 Countries interms of "Total no.of Deaths" - For All the Years¶

In [551]:
# Total causes of death across 30 years

Countries_Total_no_of_Deaths_noyear_df = df.groupby('Country/Territory').sum()
Countries_Total_no_of_Deaths_noyear_df.drop('Year',axis=1,inplace=True)
In [552]:
# Top 3 Countries interms of "Total no.of Deaths" - For All the Years

Countries_Total_no_of_Deaths_noyear_df.sort_values(by='Total_no_of_Deaths',ascending =False)[:3]
Out[552]:
Code Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence Maternal Disorders HIV/AIDS ... Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis Total_no_of_Deaths
Country/Territory
China CHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNC... 480899 5381846 1533092 584236 13418 2873619 776275 243257 433709 ... 4195276 770140 507664 8350399 36676826 4918899 8924906 383402 318564 265408106
India INDINDINDINDINDINDINDINDINDINDINDINDINDINDINDI... 2008944 1707561 756832 3290569 2439244 2110438 1237163 2292449 2454374 ... 4556172 170119 2356222 5346154 25232974 6294910 11804380 730580 1672179 238158165
United States USAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAU... 40032 3302609 661288 133044 0 114752 596818 25206 528417 ... 2018497 40259 121030 1359744 4949052 1514325 3026943 126712 5851 71197802

3 rows × 33 columns

In [553]:
df.head()
Out[553]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis Total_no_of_Deaths
0 Afghanistan AFG 1990 2159 1116 371 2087 93 1370 1538 ... 3709 338 2054 4154 5945 2673 5005 323 2985 147971
1 Afghanistan AFG 1991 2218 1136 374 2153 189 1391 2001 ... 3724 351 2119 4472 6050 2728 5120 332 3092 156844
2 Afghanistan AFG 1992 2475 1162 378 2441 239 1514 2299 ... 3776 386 2404 5106 6223 2830 5335 360 3325 169156
3 Afghanistan AFG 1993 2812 1187 384 2837 108 1687 2589 ... 3862 425 2797 5681 6445 2943 5568 396 3601 182230
4 Afghanistan AFG 1994 3027 1211 391 3081 211 1809 2849 ... 3932 451 3038 6001 6664 3027 5739 420 3816 194795

5 rows × 35 columns

In [ ]:
 

Identify Top Causes of Death Globally:

Calculate the total number of deaths for each cause across all years and countries.

In [554]:
# Calculate total deaths for each cause
total_deaths = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)

# Display the top causes of death
print(total_deaths)
Total_no_of_Deaths                            1468134716
Cardiovascular Diseases                        447741982
Neoplasms                                      229758538
Chronic Respiratory Diseases                   104605334
Lower Respiratory Infections                    83770038
Neonatal Disorders                              76860729
Diarrheal Diseases                              66235508
Digestive Diseases                              65638635
Tuberculosis                                    45850603
Cirrhosis and Other Chronic Liver Diseases      37479321
HIV/AIDS                                        36364419
Road Injuries                                   36296469
Diabetes Mellitus                               31448872
Alzheimer's Disease and Other Dementias         29768839
Chronic Kidney Disease                          28911692
Malaria                                         25342676
Self-harm                                       23713931
Nutritional Deficiencies                        13792032
Interpersonal Violence                          12752839
Protein-Energy Malnutrition                     12031885
Meningitis                                      10524572
Drowning                                        10301999
Maternal Disorders                               7727046
Parkinson's Disease                              7179795
Alcohol Use Disorders                            4819018
Acute Hepatitis                                  3784791
Fire, Heat, and Hot Substances                   3602914
Conflict and Terrorism                           3294053
Drug Use Disorders                               2656121
Poisonings                                       2601082
Environmental Heat and Cold Exposure             1788851
Exposure to Forces of Nature                     1490132
dtype: int64

China - Top 10 Causes of Deaths¶

In [555]:
china_10 = Countries_Total_no_of_Deaths_noyear_df.sort_values(by='Total_no_of_Deaths',ascending =False)[:1]
In [556]:
china_10.T
Out[556]:
Country/Territory China
Code CHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNC...
Meningitis 480899
Alzheimer's Disease and Other Dementias 5381846
Parkinson's Disease 1533092
Nutritional Deficiencies 584236
Malaria 13418
Drowning 2873619
Interpersonal Violence 776275
Maternal Disorders 243257
HIV/AIDS 433709
Drug Use Disorders 626914
Tuberculosis 2708461
Cardiovascular Diseases 100505973
Lower Respiratory Infections 8525819
Neonatal Disorders 4353666
Alcohol Use Disorders 485796
Self-harm 5078550
Exposure to Forces of Nature 138961
Diarrheal Diseases 886833
Environmental Heat and Cold Exposure 198582
Neoplasms 61060527
Conflict and Terrorism 3043
Diabetes Mellitus 3468554
Chronic Kidney Disease 4195276
Poisonings 770140
Protein-Energy Malnutrition 507664
Road Injuries 8350399
Chronic Respiratory Diseases 36676826
Cirrhosis and Other Chronic Liver Diseases 4918899
Digestive Diseases 8924906
Fire, Heat, and Hot Substances 383402
Acute Hepatitis 318564
Total_no_of_Deaths 265408106
In [557]:
# Access the first row (assuming it is China)
china_data = china_10.iloc[0]

# Convert all values to numeric, coercing errors to NaN
china_data = pd.to_numeric(china_data, errors='coerce')

# Drop NaN values if they exist
# china_data = china_data.dropna()
In [558]:
# Sort the values
sorted_china_data = china_data.sort_values(ascending=False)

# Get top 10 causes
top_10_china = sorted_china_data.head(10)

# Print top 10 causes
print(top_10_china)
Total_no_of_Deaths                           265408106.00
Cardiovascular Diseases                      100505973.00
Neoplasms                                     61060527.00
Chronic Respiratory Diseases                  36676826.00
Digestive Diseases                             8924906.00
Lower Respiratory Infections                   8525819.00
Road Injuries                                  8350399.00
Alzheimer's Disease and Other Dementias        5381846.00
Self-harm                                      5078550.00
Cirrhosis and Other Chronic Liver Diseases     4918899.00
Name: China, dtype: float64
In [559]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming df is your DataFrame
# Exclude the columns you don't want
columns_to_exclude = ['Rolling_Avg_Total_Deaths', 'Total_no_of_Deaths']
columns_to_plot = [col for col in df.columns if col not in columns_to_exclude]

# Reshape the DataFrame from wide to long format
df_long = df.melt(id_vars=['Year'], value_vars=columns_to_plot, var_name='Cause', value_name='Deaths')

# Ensure data types are correct
df_long['Year'] = df_long['Year'].astype(str)  # Convert Year to string if it's not
df_long['Deaths'] = pd.to_numeric(df_long['Deaths'], errors='coerce')  # Ensure Deaths is numeric

# Plotting example with seaborn
plt.figure(figsize=(12, 8))
sns.lineplot(data=df_long, x='Year', y='Deaths', hue='Cause')
plt.title('Death Causes Excluding Specified Columns')
plt.xlabel('Year')
plt.ylabel('Deaths')
plt.legend(title='Cause')
plt.show()
In [560]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4), dpi=200)
top_10_china.plot(kind='barh')
plt.xlabel("Total no. of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in China")
plt.show()

India - Top 10 Causes of Deaths¶

In [561]:
# Access the data for India
India_10 = Countries_Total_no_of_Deaths_noyear_df.loc[Countries_Total_no_of_Deaths_noyear_df.index == 'India']
In [562]:
# Drop the 'Total_no_of_Deaths' column and convert the data to numeric
india_data = pd.to_numeric(India_10.iloc[0].drop('Total_no_of_Deaths'), errors='coerce')

# Drop NaN values if they exist
india_data = india_data.dropna()

# Sort the values
sorted_india_data = india_data.sort_values(ascending=False)

# Get top 10 causes
top_10_india = sorted_india_data.head(10)

# Print top 10 causes
print(top_10_india)
Cardiovascular Diseases                      52994710.00
Diarrheal Diseases                           26243547.00
Chronic Respiratory Diseases                 25232974.00
Neonatal Disorders                           20911570.00
Neoplasms                                    17762703.00
Lower Respiratory Infections                 16419404.00
Tuberculosis                                 15820922.00
Digestive Diseases                           11804380.00
Cirrhosis and Other Chronic Liver Diseases    6294910.00
Self-harm                                     5543395.00
Name: India, dtype: float64
In [563]:
# Plot
plt.figure(figsize=(8, 4), dpi=200)
top_10_india.plot(kind='barh')
plt.xlabel("Total no. of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in India")
plt.show()

United States - Top 10 Causes of Deaths¶

In [564]:
# Access the data for the United States
usa_data = Countries_Total_no_of_Deaths_noyear_df.loc['United States']
In [565]:
# Check if 'USA' is in the index
print(Countries_Total_no_of_Deaths_noyear_df.index)
Index(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       ...
       'United States', 'United States Virgin Islands', 'Uruguay',
       'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia',
       'Zimbabwe'],
      dtype='object', name='Country/Territory', length=204)
In [566]:
# Drop the 'Total_no_of_Deaths' column if present and convert to numeric
if 'Total_no_of_Deaths' in usa_data.index:
    usa_data = usa_data.drop('Total_no_of_Deaths')

usa_data = pd.to_numeric(usa_data, errors='coerce')

# Drop NaNs if they exist
usa_data = usa_data.dropna()

# Sort the values
sorted_usa_data = usa_data.sort_values(ascending=False)

# Get top 10 causes
top_10_usa = sorted_usa_data.head(10)

# Print top 10 causes
print(top_10_usa)
Cardiovascular Diseases                      26438346.00
Neoplasms                                    18905315.00
Chronic Respiratory Diseases                  4949052.00
Alzheimer's Disease and Other Dementias       3302609.00
Digestive Diseases                            3026943.00
Lower Respiratory Infections                  2248625.00
Diabetes Mellitus                             2030631.00
Chronic Kidney Disease                        2018497.00
Cirrhosis and Other Chronic Liver Diseases    1514325.00
Road Injuries                                 1359744.00
Name: United States, dtype: float64
In [567]:
# Plot
plt.figure(figsize=(8, 4), dpi=200)
top_10_usa.plot(kind='barh')
plt.xlabel("Total no. of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in the United States")
plt.show()

Evaluate Trends Over Time¶

In [568]:
# Group data by year and calculate the total number of deaths per year
deaths_per_year = df.groupby('Year')['Total_no_of_Deaths'].sum()

# Plot the trend of deaths over time
plt.figure(figsize=(12, 6))
plt.plot(deaths_per_year, marker='o')
plt.title('Total Deaths Over Time (1990-2019)')
plt.xlabel('Year')
plt.ylabel('Total Number of Deaths')
plt.grid(True)
plt.show()
In [569]:
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare data by grouping and summing
trend_df = df.groupby(['Year']).sum()

# Plot trend for a specific cause of death
plt.figure(figsize=(12, 8), dpi=200)
sns.lineplot(data=trend_df, x='Year', y='Cardiovascular Diseases', marker='o')
plt.xlabel("Year")
plt.ylabel("Cardiovascular Diseases Deaths")
plt.title("Trend of Cardiovascular Diseases Deaths Over Time")
plt.grid(True)
plt.show()

Highlight Increases or Decreases

calculate the percentage change between years:

In [570]:
# Calculate percentage change
trend_df['Cardiovascular Diseases Change'] = trend_df['Cardiovascular Diseases'].pct_change() * 100

# Plot percentage change
plt.figure(figsize=(12, 8), dpi=200)
sns.lineplot(data=trend_df, x=trend_df.index, y='Cardiovascular Diseases Change', marker='o')
plt.xlabel("Year")
plt.ylabel("Percentage Change in Cardiovascular Diseases Deaths")
plt.title("Percentage Change in Cardiovascular Diseases Deaths Over Time")
plt.grid(True)
plt.show()

Identify Top Causes of Death Globally and Regionally

In [571]:
# Global top causes of death
total_deaths = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)
top_global_causes = total_deaths.head(10)

plt.figure(figsize=(12, 8), dpi=200)
top_global_causes.plot(kind='barh')
plt.xlabel("Total Number of Deaths")
plt.ylabel("Causes of Death")
plt.title("Top 10 Causes of Deaths Globally")
plt.show()

Advanced Visualizations

In [572]:
# Trend visualization with Seaborn
plt.figure(figsize=(12, 8), dpi=200)
sns.lineplot(data=trend_df, x=trend_df.index, y='Cardiovascular Diseases', marker='o', color='b')
plt.xlabel("Year")
plt.ylabel("Cardiovascular Diseases Deaths")
plt.title("Trend of Cardiovascular Diseases Deaths Over Time")
plt.grid(True)
plt.show()

# Heatmap for global causes of death
plt.figure(figsize=(14, 10), dpi=200)
sns.heatmap(df.drop(columns=['Country/Territory', 'Code', 'Year']).corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Causes of Death")
plt.show()
In [573]:
# Trend Analysis:

# Analyze trends over time to identify increases or decreases.

import matplotlib.pyplot as plt

# Plot trends for a specific cause (e.g., Cardiovascular Diseases)
df.groupby('Year')['Cardiovascular Diseases'].sum().plot()
plt.title('Trends in Cardiovascular Diseases')
plt.xlabel('Year')
plt.ylabel('Number of Deaths')
plt.show()

Regional Analysis:

Compare the top causes of death across different countries.

In [574]:
# Calculate the average number of deaths by country
# Ensure all relevant columns are numeric
for col in df.columns[3:]:  # Adjust based on your actual columns
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Keep only numeric columns for aggregation
df_numeric = df.select_dtypes(include=['number'])

# Calculate the average number of deaths by country
average_deaths_by_country = df_numeric.groupby(df['Country/Territory']).mean()

# Display the top countries for a specific cause, e.g., 'Cardiovascular Diseases'
top_countries = average_deaths_by_country['Cardiovascular Diseases'].sort_values(ascending=False)
print(top_countries.head(10))
Country/Territory
China           3350199.10
India           1766490.33
Russia          1130126.03
United States    881278.20
Indonesia        452900.37
Ukraine          435101.73
Germany          360659.00
Brazil           319633.97
Japan            307014.57
Pakistan         258173.07
Name: Cardiovascular Diseases, dtype: float64
In [575]:
# improve readability
pd.options.display.float_format = '{:.2f}'.format
print(top_countries.head(10))
Country/Territory
China           3350199.10
India           1766490.33
Russia          1130126.03
United States    881278.20
Indonesia        452900.37
Ukraine          435101.73
Germany          360659.00
Brazil           319633.97
Japan            307014.57
Pakistan         258173.07
Name: Cardiovascular Diseases, dtype: float64
In [576]:
df.head()
Out[576]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis Total_no_of_Deaths
0 Afghanistan AFG 1990 2159 1116 371 2087 93 1370 1538 ... 3709 338 2054 4154 5945 2673 5005 323 2985 147971
1 Afghanistan AFG 1991 2218 1136 374 2153 189 1391 2001 ... 3724 351 2119 4472 6050 2728 5120 332 3092 156844
2 Afghanistan AFG 1992 2475 1162 378 2441 239 1514 2299 ... 3776 386 2404 5106 6223 2830 5335 360 3325 169156
3 Afghanistan AFG 1993 2812 1187 384 2837 108 1687 2589 ... 3862 425 2797 5681 6445 2943 5568 396 3601 182230
4 Afghanistan AFG 1994 3027 1211 391 3081 211 1809 2849 ... 3932 451 3038 6001 6664 3027 5739 420 3816 194795

5 rows × 35 columns

Sample Data for Regional Comparisons

In [577]:
# Sample data: Top 10 countries with the highest number of deaths from Cardiovascular Diseases
top_countries = df.groupby('Country/Territory')['Cardiovascular Diseases'].sum().nlargest(10)

# Set figure size
plt.figure(figsize=(12, 8))

# Create a bar plot
sns.barplot(x=top_countries.index, y=top_countries.values, palette='viridis')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Add titles and labels
plt.title('Top 10 Countries/Territories by Cardiovascular Diseases', fontsize=16)
plt.xlabel('Country/Territory', fontsize=14)
plt.ylabel('Number of Deaths', fontsize=14)

# Improve layout and avoid clipping
plt.tight_layout()

# Show plot
plt.show()

Statistical Analysis

Correlation Analysis:

In [578]:
# Select only numerical columns for correlation analysis
numerical_cols = df.select_dtypes(include=['number']).columns
df_numerical = df[numerical_cols]
# Calculate correlation matrix
correlation_matrix = df_numerical.corr()
In [579]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numerical Features')
plt.show()
In [580]:
df.columns
Out[580]:
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
       'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
       'Total_no_of_Deaths'],
      dtype='object')
In [581]:
# Define the columns to include in the sample correlation matrix
sample_columns = [
    'Meningitis', 
    'Nutritional Deficiencies', 
    'Alzheimer\'s Disease and Other Dementias', 
    'Parkinson\'s Disease', 
    'Malaria', 
    'Drowning', 
    'Interpersonal Violence'
]


# Filter the DataFrame to only include these columns
df_sample = df[sample_columns]

# Calculate the correlation matrix for the sample
correlation_matrix_sample = df_sample.corr()

# Plot the heatmap of the sample correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_sample, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Sample Correlation Matrix of Selected Features')
plt.show()
In [582]:
# Top Causes of Death: Identify and visualize the top causes of death.
top_causes = df[['Meningitis', 'Nutritional Deficiencies', 'Alzheimer\'s Disease and Other Dementias']].sum().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
top_causes.plot(kind='bar', color='teal')
plt.xlabel('Cause of Death')
plt.ylabel('Total Deaths')
plt.title('Top Causes of Death Globally')
plt.show()
In [583]:
# Interactive Visualization
import plotly.express as px

# Top causes of death
top_causes = total_deaths.head(10).reset_index()
top_causes.columns = ['Cause', 'Total Deaths']

fig = px.bar(top_causes, x='Total Deaths', y='Cause', title='Top 10 Causes of Death Globally', orientation='h')
fig.show()
In [584]:
diseases = ['Meningitis', 
            'Alzheimer\'s Disease and Other Dementias',
            'Parkinson\'s Disease', 
            'Nutritional Deficiencies', 
            'Malaria', 
            'Drowning', 
            'Interpersonal Violence', 
            'Maternal Disorders', 
            'HIV/AIDS', 
            'Drug Use Disorders', 
            'Tuberculosis', 
            'Cardiovascular Diseases', 
            'Lower Respiratory Infections', 
            'Neonatal Disorders', 
            'Alcohol Use Disorders', 
            'Self-harm', 
            'Exposure to Forces of Nature', 
            'Diarrheal Diseases', 
            'Environmental Heat and Cold Exposure', 
            'Neoplasms', 
            'Conflict and Terrorism', 
            'Diabetes Mellitus', 
            'Chronic Kidney Disease', 
            'Poisonings', 
            'Protein-Energy Malnutrition', 
            'Road Injuries', 
            'Chronic Respiratory Diseases', 
            'Cirrhosis and Other Chronic Liver Diseases', 
            'Digestive Diseases', 
            'Fire, Heat, and Hot Substances', 
            'Acute Hepatitis']

for x in diseases:
    if df[x].dtypes != 'string':
        data = df.groupby(['Country/Territory'])[x].sum().sort_values(ascending=False)[:10]
        plt.figure(figsize=(12,6))
        plt.bar(data=data, x=data.index, height=data.values, width=0.9,
                color=['crimson', 'blue', 'green', 'yellow', 'magenta'])
        plt.xticks(rotation='vertical')
        plt.xlabel("COUNTRIES", size=10)
        plt.ylabel(x.upper() + ' DEATHS PER MILLION')
        plt.title("COUNTRIES WITH HIGHEST " + x.upper() + ' DEATHS')
        plt.show()

Trend Analysis¶

Trends Over Time:

Time series of total number of deaths around the world¶

In [585]:
# Find the total number of deaths group by year

Deaths_by_year = df.groupby('Year')['Total_no_of_Deaths'].sum().reset_index()

Deaths_by_year
Out[585]:
Year Total_no_of_Deaths
0 1990 43518516
1 1991 44059729
2 1992 44459130
3 1993 45185713
4 1994 46182613
5 1995 46177018
6 1996 46320827
7 1997 46672370
8 1998 47066088
9 1999 47652090
10 2000 48050317
11 2001 48385692
12 2002 48897031
13 2003 49123952
14 2004 49330171
15 2005 49591909
16 2006 49424521
17 2007 49495216
18 2008 50115740
19 2009 49900666
20 2010 50422775
21 2011 50413303
22 2012 50597654
23 2013 50931550
24 2014 51268375
25 2015 51856393
26 2016 52337435
27 2017 52789758
28 2018 53545244
29 2019 54362920
In [586]:
# Create line chart 

plt.figure(figsize=(12,8))

sns.lineplot(data = Deaths_by_year, x = 'Year', y = 'Total_no_of_Deaths')

plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time series of total number of deaths around the world', fontsize=15)
Out[586]:
Text(0.5, 1.0, 'Time series of total number of deaths around the world')

Time series compare the total number of deaths between top 10 countries¶

In [587]:
df.columns
Out[587]:
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
       'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
       'Total_no_of_Deaths'],
      dtype='object')
In [588]:
# Create line chart to Compare the Total Number of Deaths Between Top 10 Countries

plt.figure(figsize=(12,8))

for i in Top10_countries['Country/Territory']:
    a= df[df['Country/Territory']==i]
    sns.lineplot(data=a, x='Year', y='Total_no_of_Deaths',label=i)
    
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time series compare the total number of deaths between top 10 countries', fontsize=15)
Out[588]:
Text(0.5, 1.0, 'Time series compare the total number of deaths between top 10 countries')
In [589]:
import matplotlib.pyplot as plt

# Group by year and calculate the mean deaths for a specific cause, e.g., 'Cardiovascular Diseases'
trend_cardio = df.groupby('Year')['Cardiovascular Diseases'].mean()

# Plot the trend over time
plt.figure(figsize=(12, 6))
plt.plot(trend_cardio.index, trend_cardio.values, marker='o')
plt.xlabel('Year')
plt.ylabel('Average Deaths')
plt.title('Trend of Cardiovascular Diseases Over Time')
plt.grid(True)
plt.show()

Visualizing Top Causes of Death

In [590]:
import seaborn as sns

# Calculate total deaths for each cause
total_deaths = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)

# Create a DataFrame for top causes of death
top_causes = total_deaths.head(10).reset_index()
top_causes.columns = ['Cause', 'Total Deaths']

# Plot
plt.figure(figsize=(12, 8))
sns.barplot(x='Total Deaths', y='Cause', data=top_causes, palette='viridis')
plt.xlabel('Total Deaths')
plt.title('Top 10 Causes of Death Globally')
plt.show()

Analyze Outlier Distribution:

In [591]:
plt.figure(figsize=(15,10))
sns.boxplot(data=df.drop(columns=['Country/Territory', 'Code', 'Year']))
plt.xticks(rotation=90)
plt.show()
In [592]:
# Visualize distributions of key features
df.hist(bins=30, figsize=(20,15))
plt.show()
In [593]:
for column in df.select_dtypes(include=[np.number]).columns:
    sns.boxplot(x=df[column])
    plt.title(f'Boxplot of {column}')
    plt.show()
In [594]:
# List of African countries (this can be expanded or adjusted as needed)
african_countries = [
    'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi', 
    'Cabo Verde', 'Cameroon', 'Central African Republic', 'Chad', 'Comoros', 
    'Congo', 'Cote d\'Ivoire', 'Djibouti', 'DR Congo', 'Egypt', 'Equatorial Guinea',
    'Eritrea', 'Eswatini', 'Ethiopia', 'Gabon', 'Gambia', 'Ghana', 'Guinea', 
    'Guinea-Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 
    'Mali', 'Mauritania', 'Mauritius', 'Morocco', 'Mozambique', 'Namibia', 
    'Niger', 'Nigeria', 'Rwanda', 'Sao Tome and Principe', 'Senegal', 'Seychelles', 
    'Sierra Leone', 'Somalia', 'South Africa', 'South Sudan', 'Sudan', 'Tanzania', 
    'Togo', 'Tunisia', 'Uganda', 'Zambia', 'Zimbabwe'
]

# Filter out African countries
non_african_malaria = df[~df['Country/Territory'].isin(african_countries)]
malaria_non_african = non_african_malaria.groupby("Country/Territory")["Malaria"].sum().sort_values(ascending=False).head(10)

malaria_non_african
Out[594]:
Country/Territory
Democratic Republic of Congo    2557219
India                           2439244
Bangladesh                       349375
Pakistan                         213590
Myanmar                          157143
Yemen                            143463
Indonesia                         74664
Brazil                            39970
Papua New Guinea                  35997
Haiti                             28833
Name: Malaria, dtype: int64

Feature Engineering¶

Create New Features:

Yearly Trends: You might want to create features that capture trends over time.

In [595]:
# Example: Change in Meningitis cases from the previous year
df['meningitis_change'] = df.groupby('Country/Territory')['Meningitis'].diff().fillna(0)

Yearly Trend Features: You could create features that capture the trend of deaths over the years.

In [596]:
# Example: Cumulative sum of deaths
df['Cumulative_Deaths'] = df.groupby('Country/Territory')['Total_no_of_Deaths'].cumsum()
In [ ]:
 
In [ ]:
 

Data Cleaning¶

In [597]:
# Check data types
df.dtypes
Out[597]:
Country/Territory                              object
Code                                           object
Year                                            int64
Meningitis                                      int64
Alzheimer's Disease and Other Dementias         int64
Parkinson's Disease                             int64
Nutritional Deficiencies                        int64
Malaria                                         int64
Drowning                                        int64
Interpersonal Violence                          int64
Maternal Disorders                              int64
HIV/AIDS                                        int64
Drug Use Disorders                              int64
Tuberculosis                                    int64
Cardiovascular Diseases                         int64
Lower Respiratory Infections                    int64
Neonatal Disorders                              int64
Alcohol Use Disorders                           int64
Self-harm                                       int64
Exposure to Forces of Nature                    int64
Diarrheal Diseases                              int64
Environmental Heat and Cold Exposure            int64
Neoplasms                                       int64
Conflict and Terrorism                          int64
Diabetes Mellitus                               int64
Chronic Kidney Disease                          int64
Poisonings                                      int64
Protein-Energy Malnutrition                     int64
Road Injuries                                   int64
Chronic Respiratory Diseases                    int64
Cirrhosis and Other Chronic Liver Diseases      int64
Digestive Diseases                              int64
Fire, Heat, and Hot Substances                  int64
Acute Hepatitis                                 int64
Total_no_of_Deaths                              int64
meningitis_change                             float64
Cumulative_Deaths                               int64
dtype: object
In [598]:
df['Year'] = df['Year'].astype(int)
In [599]:
df.dtypes
Out[599]:
Country/Territory                              object
Code                                           object
Year                                            int32
Meningitis                                      int64
Alzheimer's Disease and Other Dementias         int64
Parkinson's Disease                             int64
Nutritional Deficiencies                        int64
Malaria                                         int64
Drowning                                        int64
Interpersonal Violence                          int64
Maternal Disorders                              int64
HIV/AIDS                                        int64
Drug Use Disorders                              int64
Tuberculosis                                    int64
Cardiovascular Diseases                         int64
Lower Respiratory Infections                    int64
Neonatal Disorders                              int64
Alcohol Use Disorders                           int64
Self-harm                                       int64
Exposure to Forces of Nature                    int64
Diarrheal Diseases                              int64
Environmental Heat and Cold Exposure            int64
Neoplasms                                       int64
Conflict and Terrorism                          int64
Diabetes Mellitus                               int64
Chronic Kidney Disease                          int64
Poisonings                                      int64
Protein-Energy Malnutrition                     int64
Road Injuries                                   int64
Chronic Respiratory Diseases                    int64
Cirrhosis and Other Chronic Liver Diseases      int64
Digestive Diseases                              int64
Fire, Heat, and Hot Substances                  int64
Acute Hepatitis                                 int64
Total_no_of_Deaths                              int64
meningitis_change                             float64
Cumulative_Deaths                               int64
dtype: object
In [600]:
df.drop(['Country/Territory', 'Code'], axis=1, inplace=True)

Renaming columns to a cleaner format in the original DataFrame

In [601]:
# 1. Renaming columns to a cleaner format
df.columns = df.columns.str.lower().str.replace("'", "").str.replace(" ", "_")
In [602]:
df.head()
Out[602]:
year meningitis alzheimers_disease_and_other_dementias parkinsons_disease nutritional_deficiencies malaria drowning interpersonal_violence maternal_disorders hiv/aids ... protein-energy_malnutrition road_injuries chronic_respiratory_diseases cirrhosis_and_other_chronic_liver_diseases digestive_diseases fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths meningitis_change cumulative_deaths
0 1990 2159 1116 371 2087 93 1370 1538 2655 34 ... 2054 4154 5945 2673 5005 323 2985 147971 0.00 147971
1 1991 2218 1136 374 2153 189 1391 2001 2885 41 ... 2119 4472 6050 2728 5120 332 3092 156844 59.00 304815
2 1992 2475 1162 378 2441 239 1514 2299 3315 48 ... 2404 5106 6223 2830 5335 360 3325 169156 257.00 473971
3 1993 2812 1187 384 2837 108 1687 2589 3671 56 ... 2797 5681 6445 2943 5568 396 3601 182230 337.00 656201
4 1994 3027 1211 391 3081 211 1809 2849 3863 63 ... 3038 6001 6664 3027 5739 420 3816 194795 215.00 850996

5 rows × 35 columns

In [603]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6120 entries, 0 to 6119
Data columns (total 35 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   year                                        6120 non-null   int32  
 1   meningitis                                  6120 non-null   int64  
 2   alzheimers_disease_and_other_dementias      6120 non-null   int64  
 3   parkinsons_disease                          6120 non-null   int64  
 4   nutritional_deficiencies                    6120 non-null   int64  
 5   malaria                                     6120 non-null   int64  
 6   drowning                                    6120 non-null   int64  
 7   interpersonal_violence                      6120 non-null   int64  
 8   maternal_disorders                          6120 non-null   int64  
 9   hiv/aids                                    6120 non-null   int64  
 10  drug_use_disorders                          6120 non-null   int64  
 11  tuberculosis                                6120 non-null   int64  
 12  cardiovascular_diseases                     6120 non-null   int64  
 13  lower_respiratory_infections                6120 non-null   int64  
 14  neonatal_disorders                          6120 non-null   int64  
 15  alcohol_use_disorders                       6120 non-null   int64  
 16  self-harm                                   6120 non-null   int64  
 17  exposure_to_forces_of_nature                6120 non-null   int64  
 18  diarrheal_diseases                          6120 non-null   int64  
 19  environmental_heat_and_cold_exposure        6120 non-null   int64  
 20  neoplasms                                   6120 non-null   int64  
 21  conflict_and_terrorism                      6120 non-null   int64  
 22  diabetes_mellitus                           6120 non-null   int64  
 23  chronic_kidney_disease                      6120 non-null   int64  
 24  poisonings                                  6120 non-null   int64  
 25  protein-energy_malnutrition                 6120 non-null   int64  
 26  road_injuries                               6120 non-null   int64  
 27  chronic_respiratory_diseases                6120 non-null   int64  
 28  cirrhosis_and_other_chronic_liver_diseases  6120 non-null   int64  
 29  digestive_diseases                          6120 non-null   int64  
 30  fire,_heat,_and_hot_substances              6120 non-null   int64  
 31  acute_hepatitis                             6120 non-null   int64  
 32  total_no_of_deaths                          6120 non-null   int64  
 33  meningitis_change                           6120 non-null   float64
 34  cumulative_deaths                           6120 non-null   int64  
dtypes: float64(1), int32(1), int64(33)
memory usage: 1.6 MB
In [604]:
# Convert 'meningitis_change' from float64 to int64
df['meningitis_change'] = df['meningitis_change'].astype(int)

# Verify the change
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6120 entries, 0 to 6119
Data columns (total 35 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   year                                        6120 non-null   int32
 1   meningitis                                  6120 non-null   int64
 2   alzheimers_disease_and_other_dementias      6120 non-null   int64
 3   parkinsons_disease                          6120 non-null   int64
 4   nutritional_deficiencies                    6120 non-null   int64
 5   malaria                                     6120 non-null   int64
 6   drowning                                    6120 non-null   int64
 7   interpersonal_violence                      6120 non-null   int64
 8   maternal_disorders                          6120 non-null   int64
 9   hiv/aids                                    6120 non-null   int64
 10  drug_use_disorders                          6120 non-null   int64
 11  tuberculosis                                6120 non-null   int64
 12  cardiovascular_diseases                     6120 non-null   int64
 13  lower_respiratory_infections                6120 non-null   int64
 14  neonatal_disorders                          6120 non-null   int64
 15  alcohol_use_disorders                       6120 non-null   int64
 16  self-harm                                   6120 non-null   int64
 17  exposure_to_forces_of_nature                6120 non-null   int64
 18  diarrheal_diseases                          6120 non-null   int64
 19  environmental_heat_and_cold_exposure        6120 non-null   int64
 20  neoplasms                                   6120 non-null   int64
 21  conflict_and_terrorism                      6120 non-null   int64
 22  diabetes_mellitus                           6120 non-null   int64
 23  chronic_kidney_disease                      6120 non-null   int64
 24  poisonings                                  6120 non-null   int64
 25  protein-energy_malnutrition                 6120 non-null   int64
 26  road_injuries                               6120 non-null   int64
 27  chronic_respiratory_diseases                6120 non-null   int64
 28  cirrhosis_and_other_chronic_liver_diseases  6120 non-null   int64
 29  digestive_diseases                          6120 non-null   int64
 30  fire,_heat,_and_hot_substances              6120 non-null   int64
 31  acute_hepatitis                             6120 non-null   int64
 32  total_no_of_deaths                          6120 non-null   int64
 33  meningitis_change                           6120 non-null   int32
 34  cumulative_deaths                           6120 non-null   int64
dtypes: int32(2), int64(33)
memory usage: 1.6 MB
None
In [ ]:
 
In [613]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
In [614]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the model
lin_reg_model = LinearRegression()
lin_reg_model.fit(X_train, y_train)

# Make predictions
y_pred_lin_reg = lin_reg_model.predict(X_test)

# Evaluate the model
mse_lin_reg = mean_squared_error(y_test, y_pred_lin_reg)
r2_lin_reg = r2_score(y_test, y_pred_lin_reg)

print(f"Linear Regression Mean Squared Error: {mse_lin_reg}")
print(f"Linear Regression R-squared: {r2_lin_reg}")
Linear Regression Mean Squared Error: 2.0415135847184325e-18
Linear Regression R-squared: 1.0
In [616]:
from sklearn.linear_model import Ridge

# Initialize Ridge Regression
ridge_model = Ridge(alpha=1.0)  # Adjust alpha to control regularization strength
ridge_model.fit(X_train, y_train)

# Make predictions
y_pred_ridge = ridge_model.predict(X_test)

# Evaluate the model
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f"Ridge Regression Mean Squared Error: {mse_ridge}")
print(f"Ridge Regression R-squared: {r2_ridge}")
Ridge Regression Mean Squared Error: 637395.7524604569
Ridge Regression R-squared: 0.9999989133618953
In [617]:
alphas = [0.1, 1.0, 10.0, 100.0]
for alpha in alphas:
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    y_pred_ridge = ridge_model.predict(X_test)
    mse_ridge = mean_squared_error(y_test, y_pred_ridge)
    r2_ridge = r2_score(y_test, y_pred_ridge)
    print(f"Alpha: {alpha}")
    print(f"Ridge Regression Mean Squared Error: {mse_ridge}")
    print(f"Ridge Regression R-squared: {r2_ridge}")
Alpha: 0.1
Ridge Regression Mean Squared Error: 7541.972988469264
Ridge Regression R-squared: 0.9999999871423755
Alpha: 1.0
Ridge Regression Mean Squared Error: 637395.7524604569
Ridge Regression R-squared: 0.9999989133618953
Alpha: 10.0
Ridge Regression Mean Squared Error: 26572135.303873844
Ridge Regression R-squared: 0.9999546995808584
Alpha: 100.0
Ridge Regression Mean Squared Error: 308119194.9415028
Ridge Regression R-squared: 0.999474715580181
In [618]:
from sklearn.linear_model import Lasso

# Initialize Lasso Regression
lasso_model = Lasso(alpha=1.0)  # Adjust alpha as needed
lasso_model.fit(X_train, y_train)

# Make predictions
y_pred_lasso = lasso_model.predict(X_test)

# Evaluate the model
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f"Lasso Regression Mean Squared Error: {mse_lasso}")
print(f"Lasso Regression R-squared: {r2_lasso}")
Lasso Regression Mean Squared Error: 271204822.12131757
Lasso Regression R-squared: 0.999537647540371
D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.692e+11, tolerance: 3.953e+11

In [619]:
from sklearn.ensemble import RandomForestRegressor

# Initialize Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")
Random Forest Mean Squared Error: 441563295.6740831
Random Forest R-squared: 0.9992472188575413
In [620]:
from sklearn.model_selection import cross_val_score

# Cross-validation with Ridge Regression
cv_scores_ridge = cross_val_score(ridge_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validated Mean Squared Error (Ridge): {-cv_scores_ridge.mean()}")

# Cross-validation with Lasso Regression
cv_scores_lasso = cross_val_score(lasso_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validated Mean Squared Error (Lasso): {-cv_scores_lasso.mean()}")
Cross-Validated Mean Squared Error (Ridge): 3.5722394712611494e-07
D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.501e+11, tolerance: 2.362e+11

D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.994e+11, tolerance: 3.809e+11

D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.879e+11, tolerance: 4.456e+11

Cross-Validated Mean Squared Error (Lasso): 3183700565.571262
D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.747e+11, tolerance: 4.485e+11

In [621]:
# Hyperparameter tuning
# ridge reg
from sklearn.model_selection import GridSearchCV

# Define parameter grid
parameters_ridge = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
ridge_model = Ridge()

# Perform grid search
grid_search_ridge = GridSearchCV(ridge_model, parameters_ridge, scoring='neg_mean_squared_error', cv=5)
grid_search_ridge.fit(X_train, y_train)

# Best parameters and score
print(f"Best Parameters (Ridge): {grid_search_ridge.best_params_}")
print(f"Best Score (Ridge): {-grid_search_ridge.best_score_}")
Best Parameters (Ridge): {'alpha': 0.01}
Best Score (Ridge): 149.56717271873725
In [622]:
# lasso reg
# Define parameter grid for Lasso
parameters_lasso = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
lasso_model = Lasso(max_iter=10000)

# Perform grid search
grid_search_lasso = GridSearchCV(lasso_model, parameters_lasso, scoring='neg_mean_squared_error', cv=5)
grid_search_lasso.fit(X_train, y_train)

# Best parameters and score
print(f"Best Parameters (Lasso): {grid_search_lasso.best_params_}")
print(f"Best Score (Lasso): {-grid_search_lasso.best_score_}")
Best Parameters (Lasso): {'alpha': 0.01}
Best Score (Lasso): 2120883.2258921517
In [625]:
# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Create preprocessing pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numerical_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features)
    ]
)

# Apply preprocessing to both training and testing data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Initialize models
ridge_model = Ridge()
lasso_model = Lasso(max_iter=10000)
rf_model = RandomForestRegressor()
gb_model = GradientBoostingRegressor()

# Define parameter grids for GridSearchCV
parameters_ridge = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
parameters_lasso = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}

# Perform grid search for Ridge and Lasso
grid_search_ridge = GridSearchCV(ridge_model, parameters_ridge, scoring='neg_mean_squared_error', cv=5)
grid_search_ridge.fit(X_train_preprocessed, y_train)
best_ridge_model = grid_search_ridge.best_estimator_

grid_search_lasso = GridSearchCV(lasso_model, parameters_lasso, scoring='neg_mean_squared_error', cv=5)
grid_search_lasso.fit(X_train_preprocessed, y_train)
best_lasso_model = grid_search_lasso.best_estimator_

# Train Random Forest and Gradient Boosting models
rf_model.fit(X_train_preprocessed, y_train)
gb_model.fit(X_train_preprocessed, y_train)

# Make predictions
y_pred_ridge = best_ridge_model.predict(X_test_preprocessed)
y_pred_lasso = best_lasso_model.predict(X_test_preprocessed)
y_pred_rf = rf_model.predict(X_test_preprocessed)
y_pred_gb = gb_model.predict(X_test_preprocessed)

# Evaluate models
def evaluate_model(y_true, y_pred, model_name):
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{model_name} Mean Squared Error: {mse}")
    print(f"{model_name} R-squared: {r2}")

evaluate_model(y_test, y_pred_ridge, "Ridge Regression")
evaluate_model(y_test, y_pred_lasso, "Lasso Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest")
evaluate_model(y_test, y_pred_gb, "Gradient Boosting")

# Cross-validation scores
cv_scores_ridge = cross_val_score(best_ridge_model, preprocessor.transform(X), y, cv=5, scoring='neg_mean_squared_error')
cv_scores_lasso = cross_val_score(best_lasso_model, preprocessor.transform(X), y, cv=5, scoring='neg_mean_squared_error')

print(f"Cross-Validated Mean Squared Error (Ridge): {-cv_scores_ridge.mean()}")
print(f"Cross-Validated Mean Squared Error (Lasso): {-cv_scores_lasso.mean()}")
Ridge Regression Mean Squared Error: 75.21693165381222
Ridge Regression R-squared: 0.9999999998717695
Lasso Regression Mean Squared Error: 2109596.3646646277
Lasso Regression R-squared: 0.9999964035408353
Random Forest Mean Squared Error: 412402826.7590321
Random Forest R-squared: 0.9992969318914813
Gradient Boosting Mean Squared Error: 607484604.8760853
Gradient Boosting R-squared: 0.9989643546930536
Cross-Validated Mean Squared Error (Ridge): 2917.6510757034894
Cross-Validated Mean Squared Error (Lasso): 24126655.184786927
In [626]:
# advanced techniques
from sklearn.ensemble import GradientBoostingRegressor

# Initialize and train Gradient Boosting Regressor
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train_preprocessed, y_train)

# Make predictions
y_pred_gb = gb_model.predict(X_test_preprocessed)

# Evaluate the Gradient Boosting model
evaluate_model(y_test, y_pred_gb, "Gradient Boosting")
Gradient Boosting Mean Squared Error: 646083812.675353
Gradient Boosting R-squared: 0.9988985504107915
  1. Ridge Regression and Lasso Regression both show exceptionally high R² values, indicating that they explain almost all the variance in the data. However, Lasso Regression has a much higher MSE, which might suggest it is less suited for this dataset compared to Ridge Regression.

  2. Random Forest and Gradient Boosting have lower R² values compared to Ridge and Lasso, and their MSE values are significantly higher. This suggests that, while these models are capturing some patterns, they may be overfitting or may not be as well-tuned for this specific problem.

  3. The cross-validated MSE values for Ridge and Lasso show a significant difference from the training MSE values, indicating that while these models perform very well on training data, their performance on unseen data might be less impressive.

In [627]:
# hyperparameter tuning using GridSearchCV:
from sklearn.model_selection import GridSearchCV

# Define parameter grids for Random Forest and Gradient Boosting
parameters_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

parameters_gb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 0.9, 1.0]
}

# Initialize models
rf_model = RandomForestRegressor()
gb_model = GradientBoostingRegressor()

# Grid Search for Random Forest
grid_search_rf = GridSearchCV(rf_model, parameters_rf, cv=5, scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train_preprocessed, y_train)
best_rf_model = grid_search_rf.best_estimator_

# Grid Search for Gradient Boosting
grid_search_gb = GridSearchCV(gb_model, parameters_gb, cv=5, scoring='neg_mean_squared_error')
grid_search_gb.fit(X_train_preprocessed, y_train)
best_gb_model = grid_search_gb.best_estimator_

# Make predictions with tuned models
y_pred_rf = best_rf_model.predict(X_test_preprocessed)
y_pred_gb = best_gb_model.predict(X_test_preprocessed)

# Evaluate models
evaluate_model(y_test, y_pred_rf, "Tuned Random Forest")
evaluate_model(y_test, y_pred_gb, "Tuned Gradient Boosting")
Tuned Random Forest Mean Squared Error: 423071262.43593276
Tuned Random Forest R-squared: 0.9992787442448273
Tuned Gradient Boosting Mean Squared Error: 351335976.6355588
Tuned Gradient Boosting R-squared: 0.9994010392157373
In [628]:
# Feature Engineering
# Feature Selection: Remove irrelevant or redundant features to improve model performance. You can use techniques like Recursive Feature Elimination (RFE) or feature importance scores.

from sklearn.feature_selection import RFE
from sklearn.linear_model import Ridge

# Initialize the model and RFE
model = Ridge()
rfe = RFE(model, n_features_to_select=10)

# Fit RFE
X_train_rfe = rfe.fit_transform(X_train_preprocessed, y_train)
X_test_rfe = rfe.transform(X_test_preprocessed)

# Use the selected features to fit the model
model.fit(X_train_rfe, y_train)
y_pred = model.predict(X_test_rfe)
evaluate_model(y_test, y_pred, "Ridge with RFE")
Ridge with RFE Mean Squared Error: 1016822564.5699499
Ridge with RFE R-squared: 0.9982665115979211
In [629]:
# Model Validation
# Cross-Validation: Perform cross-validation to ensure that the model is not overfitting and performs well on unseen data.

from sklearn.model_selection import cross_val_score

# Cross-validation scores
cv_scores_rf = cross_val_score(best_rf_model, X_train_preprocessed, y_train, cv=5, scoring='neg_mean_squared_error')
cv_scores_gb = cross_val_score(best_gb_model, X_train_preprocessed, y_train, cv=5, scoring='neg_mean_squared_error')

print(f"Cross-Validated Mean Squared Error (Random Forest): {-cv_scores_rf.mean()}")
print(f"Cross-Validated Mean Squared Error (Gradient Boosting): {-cv_scores_gb.mean()}")
Cross-Validated Mean Squared Error (Random Forest): 491233682.1143919
Cross-Validated Mean Squared Error (Gradient Boosting): 320611482.3465633
In [ ]:
 
In [631]:
#  Data Quality
# Review Data Preprocessing: Check for any issues in data preprocessing and ensure all necessary transformations are applied correctly.


# Review data preprocessing steps
print(X_train_preprocessed.shape)
print(X_test_preprocessed.shape)
(4896, 32)
(1224, 32)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [405]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_y_pred = rf_model.predict(X_test)

# Evaluate the model
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)

print(f"Random Forest Mean Squared Error: {rf_mse}")
print(f"Random Forest R-squared: {rf_r2}")
Random Forest Mean Squared Error: 441563295.6740831
Random Forest R-squared: 0.9992472188575413
In [406]:
from sklearn.model_selection import cross_val_score

# Linear Regression with cross-validation
lr_model = LinearRegression()
lr_scores = cross_val_score(lr_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Linear Regression Mean Cross-Validated MSE: {-lr_scores.mean()}")

# Random Forest with cross-validation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Random Forest Mean Cross-Validated MSE: {-rf_scores.mean()}")
Linear Regression Mean Cross-Validated MSE: 6.117124717691846e-14
Random Forest Mean Cross-Validated MSE: 71124482809.29434
In [407]:
import matplotlib.pyplot as plt
import seaborn as sns

# Train Linear Regression Model
lr_model.fit(X_train, y_train)
y_train_pred = lr_model.predict(X_train)
residuals = y_train - y_train_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_train_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()
In [408]:
import pandas as pd

# Train Random Forest Model
rf_model.fit(X_train, y_train)

# Get feature importances
importances = rf_model.feature_importances_
feature_names = X.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

# Plot feature importances
plt.figure(figsize=(12, 8))
feature_importances.plot(kind='bar')
plt.title('Feature Importances from Random Forest')
plt.show()
In [409]:
# Linear Regression Evaluation
lr_model.fit(X_train, y_train)
y_test_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_test_pred_lr)
r2_lr = r2_score(y_test, y_test_pred_lr)
print(f"Linear Regression Mean Squared Error: {mse_lr}")
print(f"Linear Regression R-squared: {r2_lr}")

# Random Forest Evaluation
rf_y_pred = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, rf_y_pred)
r2_rf = r2_score(y_test, rf_y_pred)
print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")
Linear Regression Mean Squared Error: 2.0415135847184325e-18
Linear Regression R-squared: 1.0
Random Forest Mean Squared Error: 441563295.6740831
Random Forest R-squared: 0.9992472188575413
In [ ]:
 
In [ ]:
 

Check outliers

In [410]:
def detect_outliers_iqr(df, column):
    """Detect outliers using the IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return lower_bound, upper_bound
In [411]:
def plot_boxplot(df, column):
    """Plot a boxplot for visualizing outliers."""
    plt.figure(figsize=(10, 6))
    sns.boxplot(df[column])
    plt.title(f'Boxplot for {column}')
    plt.show()
In [412]:
# Check and handle outliers
for column in df.columns:
    if df[column].dtype in [np.float64, np.int64]:  # Only numeric columns
        lower_bound, upper_bound = detect_outliers_iqr(df, column)
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
        print(f"Outliers in column '{column}': {len(outliers)}")

        # Plot boxplot to visualize outliers
        plot_boxplot(df, column)
        
        # Handle outliers by capping values
        df.loc[df[column] < lower_bound, column] = lower_bound
        df.loc[df[column] > upper_bound, column] = upper_bound

# Verify if outliers are handled
for column in df.columns:
    if df[column].dtype in [np.float64, np.int64]:  # Only numeric columns
        lower_bound, upper_bound = detect_outliers_iqr(df, column)
        outliers_after = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
        print(f"Outliers in column '{column}' after handling: {len(outliers_after)}")
Outliers in column 'meningitis': 1029
Outliers in column 'alzheimers_disease_and_other_dementias': 819
Outliers in column 'parkinsons_disease': 811
Outliers in column 'nutritional_deficiencies': 950
Outliers in column 'malaria': 1278
Outliers in column 'drowning': 733
Outliers in column 'interpersonal_violence': 841
Outliers in column 'maternal_disorders': 789
Outliers in column 'hiv/aids': 1041
Outliers in column 'drug_use_disorders': 725
Outliers in column 'tuberculosis': 916
Outliers in column 'cardiovascular_diseases': 732
Outliers in column 'lower_respiratory_infections': 593
Outliers in column 'neonatal_disorders': 777
Outliers in column 'alcohol_use_disorders': 685
Outliers in column 'self-harm': 722
Outliers in column 'exposure_to_forces_of_nature': 1025
Outliers in column 'diarrheal_diseases': 926
Outliers in column 'environmental_heat_and_cold_exposure': 559
Outliers in column 'neoplasms': 768
Outliers in column 'conflict_and_terrorism': 1188
Outliers in column 'diabetes_mellitus': 872
Outliers in column 'chronic_kidney_disease': 787
Outliers in column 'poisonings': 580
Outliers in column 'protein-energy_malnutrition': 994
Outliers in column 'road_injuries': 765
Outliers in column 'chronic_respiratory_diseases': 918
Outliers in column 'cirrhosis_and_other_chronic_liver_diseases': 796
Outliers in column 'digestive_diseases': 812
Outliers in column 'fire,_heat,_and_hot_substances': 562
Outliers in column 'acute_hepatitis': 802
Outliers in column 'total_no_of_deaths': 712
Outliers in column 'cumulative_deaths': 693
Outliers in column 'meningitis' after handling: 0
Outliers in column 'alzheimers_disease_and_other_dementias' after handling: 0
Outliers in column 'parkinsons_disease' after handling: 0
Outliers in column 'nutritional_deficiencies' after handling: 0
Outliers in column 'malaria' after handling: 0
Outliers in column 'drowning' after handling: 0
Outliers in column 'interpersonal_violence' after handling: 0
Outliers in column 'maternal_disorders' after handling: 0
Outliers in column 'hiv/aids' after handling: 0
Outliers in column 'drug_use_disorders' after handling: 0
Outliers in column 'tuberculosis' after handling: 0
Outliers in column 'cardiovascular_diseases' after handling: 0
Outliers in column 'lower_respiratory_infections' after handling: 0
Outliers in column 'neonatal_disorders' after handling: 0
Outliers in column 'alcohol_use_disorders' after handling: 0
Outliers in column 'self-harm' after handling: 0
Outliers in column 'exposure_to_forces_of_nature' after handling: 0
Outliers in column 'diarrheal_diseases' after handling: 0
Outliers in column 'environmental_heat_and_cold_exposure' after handling: 0
Outliers in column 'neoplasms' after handling: 0
Outliers in column 'conflict_and_terrorism' after handling: 0
Outliers in column 'diabetes_mellitus' after handling: 0
Outliers in column 'chronic_kidney_disease' after handling: 0
Outliers in column 'poisonings' after handling: 0
Outliers in column 'protein-energy_malnutrition' after handling: 0
Outliers in column 'road_injuries' after handling: 0
Outliers in column 'chronic_respiratory_diseases' after handling: 0
Outliers in column 'cirrhosis_and_other_chronic_liver_diseases' after handling: 0
Outliers in column 'digestive_diseases' after handling: 0
Outliers in column 'fire,_heat,_and_hot_substances' after handling: 0
Outliers in column 'acute_hepatitis' after handling: 0
Outliers in column 'total_no_of_deaths' after handling: 0
Outliers in column 'cumulative_deaths' after handling: 0
In [413]:
def plot_histogram(df, column):
    """Plot a histogram for visualizing the distribution of data."""
    plt.figure(figsize=(10, 6))
    sns.histplot(df[column], kde=True, bins=30)
    plt.title(f'Histogram for {column}')
    plt.show()

# Verify the distribution after handling outliers
for column in df.columns:
    if df[column].dtype in [np.float64, np.int64]:  # Only numeric columns
        plot_histogram(df, column)
In [414]:
import pandas as pd

# Assuming df_original is your original DataFrame
# Apply your outlier handling method to create df_cleaned
# For example, let's assume you have used Z-score or IQR to handle outliers

# Example code for handling outliers using IQR
def handle_outliers_iqr(df):
    df_cleaned = df.copy()
    for column in df.columns:
        if df[column].dtype in [float, int]:  # Apply only to numeric columns
            Q1 = df_cleaned[column].quantile(0.25)
            Q3 = df_cleaned[column].quantile(0.75)
            IQR = Q3 - Q1
            df_cleaned = df_cleaned[(df_cleaned[column] >= (Q1 - 1.5 * IQR)) & (df_cleaned[column] <= (Q3 + 1.5 * IQR))]
    return df_cleaned

# Define df_cleaned by handling outliers
df_cleaned = handle_outliers_iqr(df)

# Descriptive Statistics Function
def descriptive_statistics(df):
    return df.describe().T

# Print descriptive statistics before handling outliers
print("Descriptive Statistics Before Handling Outliers")
print(descriptive_statistics(df))

# Print descriptive statistics after handling outliers
print("\nDescriptive Statistics After Handling Outliers")
print(descriptive_statistics(df_cleaned))
Descriptive Statistics Before Handling Outliers
                                             count       mean        std  \
year                                       6120.00    2004.50       8.66   
meningitis                                 6120.00     558.62     784.30   
alzheimers_disease_and_other_dementias     6120.00    1677.19    2105.45   
parkinsons_disease                         6120.00     412.24     513.54   
nutritional_deficiencies                   6120.00     739.49    1085.38   
malaria                                    6120.00     245.05     403.45   
drowning                                   6120.00     468.94     574.04   
interpersonal_violence                     6120.00     612.05     740.67   
maternal_disorders                         6120.00     453.71     664.04   
hiv/aids                                   6120.00    1198.91    1790.61   
drug_use_disorders                         6120.00      81.66     110.52   
tuberculosis                               6120.00    1919.56    2691.19   
cardiovascular_diseases                    6120.00   28583.10   35091.24   
lower_respiratory_infections               6120.00    6646.39    8377.32   
neonatal_disorders                         6120.00    4667.30    6560.24   
alcohol_use_disorders                      6120.00     206.37     260.79   
self-harm                                  6120.00    1256.36    1543.21   
exposure_to_forces_of_nature               6120.00       7.44      11.56   
diarrheal_diseases                         6120.00    2518.44    3707.42   
environmental_heat_and_cold_exposure       6120.00      66.75      86.90   
neoplasms                                  6120.00   13440.43   16817.80   
conflict_and_terrorism                     6120.00      14.63      22.94   
diabetes_mellitus                          6120.00    2103.89    2417.35   
chronic_kidney_disease                     6120.00    1988.02    2421.58   
poisonings                                 6120.00     161.24     207.59   
protein-energy_malnutrition                6120.00     659.48     982.64   
road_injuries                              6120.00    2380.77    2912.78   
chronic_respiratory_diseases               6120.00    3656.55    4472.84   
cirrhosis_and_other_chronic_liver_diseases 6120.00    2470.60    2958.07   
digestive_diseases                         6120.00    4303.08    5049.07   
fire,_heat,_and_hot_substances             6120.00     284.92     350.75   
acute_hepatitis                            6120.00     101.45     143.79   
total_no_of_deaths                         6120.00  107820.60  128801.21   
meningitis_change                          6120.00     -21.16     919.88   
cumulative_deaths                          6120.00 1483344.83 1879915.84   

                                                 min      25%       50%  \
year                                         1990.00  1997.00   2004.50   
meningitis                                      0.00    15.00    109.00   
alzheimers_disease_and_other_dementias          0.00    90.00    666.50   
parkinsons_disease                              0.00    27.00    164.00   
nutritional_deficiencies                        0.00     9.00    119.00   
malaria                                         0.00     0.00      0.00   
drowning                                        0.00    34.00    177.00   
interpersonal_violence                          0.00    40.00    265.00   
maternal_disorders                              0.00     5.00     54.00   
hiv/aids                                        0.00    11.00    136.00   
drug_use_disorders                              0.00     3.00     20.00   
tuberculosis                                    0.00    35.00    417.00   
cardiovascular_diseases                         4.00  2028.00  11742.00   
lower_respiratory_infections                    0.00   345.00   2126.50   
neonatal_disorders                              0.00   131.00    916.00   
alcohol_use_disorders                           0.00     9.00     80.00   
self-harm                                       0.00    94.00    533.00   
exposure_to_forces_of_nature                    0.00     0.00      0.00   
diarrheal_diseases                              0.00    20.00    296.50   
environmental_heat_and_cold_exposure            0.00     2.00     21.00   
neoplasms                                       1.00   809.75   5629.50   
conflict_and_terrorism                          0.00     0.00      0.00   
diabetes_mellitus                               1.00   236.00   1087.00   
chronic_kidney_disease                          0.00   145.75    822.00   
poisonings                                      0.00     6.00     52.50   
protein-energy_malnutrition                     0.00     5.00     92.00   
road_injuries                                   0.00   174.75    966.50   
chronic_respiratory_diseases                    1.00   289.00   1689.00   
cirrhosis_and_other_chronic_liver_diseases      0.00   154.00   1210.00   
digestive_diseases                              0.00   284.00   2185.00   
fire,_heat,_and_hot_substances                  0.00    17.00    126.00   
acute_hepatitis                                 0.00     2.00     15.00   
total_no_of_deaths                              7.00  6935.00  50257.50   
meningitis_change                          -10728.00   -11.00     -1.00   
cumulative_deaths                              13.00 71995.25 553431.50   

                                                  75%        max  
year                                          2012.00    2019.00  
meningitis                                     847.25    2095.62  
alzheimers_disease_and_other_dementias        2456.25    6005.62  
parkinsons_disease                             609.25    1482.62  
nutritional_deficiencies                      1167.25    2904.62  
malaria                                        393.00     982.50  
drowning                                       698.00    1694.00  
interpersonal_violence                         877.00    2132.50  
maternal_disorders                             734.00    1827.50  
hiv/aids                                      1879.00    4681.00  
drug_use_disorders                             129.00     318.00  
tuberculosis                                  2924.25    7258.12  
cardiovascular_diseases                      42546.50  103324.25  
lower_respiratory_infections                 10161.25   24885.62  
neonatal_disorders                            7419.75   18352.88  
alcohol_use_disorders                          316.00     776.50  
self-harm                                     1882.25    4564.62  
exposure_to_forces_of_nature                    12.00      30.00  
diarrheal_diseases                            3946.75    9836.88  
environmental_heat_and_cold_exposure           109.00     269.50  
neoplasms                                    20147.75   49154.75  
conflict_and_terrorism                          23.00      57.50  
diabetes_mellitus                             2954.00    7031.00  
chronic_kidney_disease                        2922.50    7087.62  
poisonings                                     254.00     626.00  
protein-energy_malnutrition                   1042.50    2598.75  
road_injuries                                 3435.25    8326.00  
chronic_respiratory_diseases                  5249.75   12690.88  
cirrhosis_and_other_chronic_liver_diseases    3547.25    8637.12  
digestive_diseases                            6080.00   14774.00  
fire,_heat,_and_hot_substances                 450.00    1099.50  
acute_hepatitis                                160.00     397.00  
total_no_of_deaths                          158221.00  385150.00  
meningitis_change                                0.00   53333.00  
cumulative_deaths                          2266613.50 5558540.88  

Descriptive Statistics After Handling Outliers
                                             count      mean       std  \
year                                       4070.00   2004.07      8.65   
meningitis                                 4070.00    173.58    395.37   
alzheimers_disease_and_other_dementias     4070.00    921.02   1412.46   
parkinsons_disease                         4070.00    221.39    336.87   
nutritional_deficiencies                   4070.00    246.26    612.31   
malaria                                    4070.00    104.94    275.86   
drowning                                   4070.00    203.85    341.15   
interpersonal_violence                     4070.00    313.91    517.93   
maternal_disorders                         4070.00    151.86    361.25   
hiv/aids                                   4070.00    494.25   1149.79   
drug_use_disorders                         4070.00     47.15     81.68   
tuberculosis                               4070.00    631.03   1409.98   
cardiovascular_diseases                    4070.00  14155.90  21033.66   
lower_respiratory_infections               4070.00   2273.97   4004.72   
neonatal_disorders                         4070.00   1509.61   3258.89   
alcohol_use_disorders                      4070.00    134.16    205.64   
self-harm                                  4070.00    590.95    933.31   
exposure_to_forces_of_nature               4070.00      3.82      8.42   
diarrheal_diseases                         4070.00    838.16   2051.25   
environmental_heat_and_cold_exposure       4070.00     34.06     63.09   
neoplasms                                  4070.00   6906.92  10548.30   
conflict_and_terrorism                     4070.00      7.24     17.01   
diabetes_mellitus                          4070.00   1015.66   1419.24   
chronic_kidney_disease                     4070.00    902.29   1401.46   
poisonings                                 4070.00     59.90    114.27   
protein-energy_malnutrition                4070.00    222.79    559.27   
road_injuries                              4070.00   1025.76   1734.52   
chronic_respiratory_diseases               4070.00   1639.05   2528.57   
cirrhosis_and_other_chronic_liver_diseases 4070.00   1043.32   1596.46   
digestive_diseases                         4070.00   1859.30   2719.22   
fire,_heat,_and_hot_substances             4070.00    118.69    198.72   
acute_hepatitis                            4070.00     32.77     76.06   
total_no_of_deaths                         4070.00  41609.82  61383.73   
meningitis_change                          4070.00     -2.10      5.94   
cumulative_deaths                          4070.00 484839.98 654676.59   

                                               min      25%       50%  \
year                                       1990.00  1997.00   2004.00   
meningitis                                    0.00     4.00     32.00   
alzheimers_disease_and_other_dementias        0.00    28.00    275.00   
parkinsons_disease                            0.00     9.00     71.00   
nutritional_deficiencies                      0.00     4.00     17.00   
malaria                                       0.00     0.00      0.00   
drowning                                      0.00    14.00     66.00   
interpersonal_violence                        0.00    16.00     99.50   
maternal_disorders                            0.00     2.00     11.00   
hiv/aids                                      0.00     4.00     37.00   
drug_use_disorders                            0.00     1.00      9.00   
tuberculosis                                  0.00    12.00     73.50   
cardiovascular_diseases                       4.00   640.50   5170.00   
lower_respiratory_infections                  0.00    86.00    678.00   
neonatal_disorders                            0.00    38.00    242.00   
alcohol_use_disorders                         0.00     5.00     25.00   
self-harm                                     0.00    31.00    212.00   
exposure_to_forces_of_nature                  0.00     0.00      0.00   
diarrheal_diseases                            0.00     6.25     60.00   
environmental_heat_and_cold_exposure          0.00     1.00      5.00   
neoplasms                                     1.00   295.25   2483.00   
conflict_and_terrorism                        0.00     0.00      0.00   
diabetes_mellitus                             1.00    91.25    481.50   
chronic_kidney_disease                        0.00    58.00    316.50   
poisonings                                    0.00     2.00     15.00   
protein-energy_malnutrition                   0.00     2.00     12.00   
road_injuries                                 0.00    45.00    395.50   
chronic_respiratory_diseases                  1.00    81.00    592.50   
cirrhosis_and_other_chronic_liver_diseases    0.00    42.00    340.50   
digestive_diseases                            0.00    85.25    674.50   
fire,_heat,_and_hot_substances                0.00     4.00     41.00   
acute_hepatitis                               0.00     1.00      4.00   
total_no_of_deaths                            7.00  1888.25  18539.00   
meningitis_change                           -27.00    -2.00      0.00   
cumulative_deaths                            13.00 21194.50 177374.00   

                                                 75%        max  
year                                         2011.00    2019.00  
meningitis                                    125.00    2095.62  
alzheimers_disease_and_other_dementias       1173.00    6005.62  
parkinsons_disease                            279.00    1482.62  
nutritional_deficiencies                      149.00    2904.62  
malaria                                         2.00     982.50  
drowning                                      216.75    1694.00  
interpersonal_violence                        355.75    2132.50  
maternal_disorders                             76.00    1827.50  
hiv/aids                                      235.00    4681.00  
drug_use_disorders                             49.00     318.00  
tuberculosis                                  500.75    7258.12  
cardiovascular_diseases                     19218.25  103324.25  
lower_respiratory_infections                 2303.25   24885.62  
neonatal_disorders                           1280.50   18352.88  
alcohol_use_disorders                         204.00     776.50  
self-harm                                     656.00    4564.62  
exposure_to_forces_of_nature                    2.00      30.00  
diarrheal_diseases                            426.25    9836.88  
environmental_heat_and_cold_exposure           32.00     269.50  
neoplasms                                    8220.00   49154.75  
conflict_and_terrorism                          2.00      57.50  
diabetes_mellitus                            1374.75    7031.00  
chronic_kidney_disease                       1070.75    7087.62  
poisonings                                     60.00     626.00  
protein-energy_malnutrition                   126.00    2598.75  
road_injuries                                1131.50    8326.00  
chronic_respiratory_diseases                 2077.25   12690.88  
cirrhosis_and_other_chronic_liver_diseases   1476.75    8637.12  
digestive_diseases                           2627.00   14774.00  
fire,_heat,_and_hot_substances                139.75    1099.50  
acute_hepatitis                                21.00     397.00  
total_no_of_deaths                          53836.25  385150.00  
meningitis_change                               0.00      16.00  
cumulative_deaths                          696965.25 2813393.00  
In [415]:
# Calculate summary statistics before handling outliers
stats_before = df.describe().T[['mean', '50%', 'std']]
stats_before.columns = ['mean_before', 'median_before', 'std_before']

# Calculate summary statistics after handling outliers
stats_after = df_cleaned.describe().T[['mean', '50%', 'std']]
stats_after.columns = ['mean_after', 'median_after', 'std_after']

# Merge statistics before and after
stats_comparison = pd.concat([stats_before, stats_after], axis=1)
print(stats_comparison)
                                            mean_before  median_before  \
year                                            2004.50        2004.50   
meningitis                                       558.62         109.00   
alzheimers_disease_and_other_dementias          1677.19         666.50   
parkinsons_disease                               412.24         164.00   
nutritional_deficiencies                         739.49         119.00   
malaria                                          245.05           0.00   
drowning                                         468.94         177.00   
interpersonal_violence                           612.05         265.00   
maternal_disorders                               453.71          54.00   
hiv/aids                                        1198.91         136.00   
drug_use_disorders                                81.66          20.00   
tuberculosis                                    1919.56         417.00   
cardiovascular_diseases                        28583.10       11742.00   
lower_respiratory_infections                    6646.39        2126.50   
neonatal_disorders                              4667.30         916.00   
alcohol_use_disorders                            206.37          80.00   
self-harm                                       1256.36         533.00   
exposure_to_forces_of_nature                       7.44           0.00   
diarrheal_diseases                              2518.44         296.50   
environmental_heat_and_cold_exposure              66.75          21.00   
neoplasms                                      13440.43        5629.50   
conflict_and_terrorism                            14.63           0.00   
diabetes_mellitus                               2103.89        1087.00   
chronic_kidney_disease                          1988.02         822.00   
poisonings                                       161.24          52.50   
protein-energy_malnutrition                      659.48          92.00   
road_injuries                                   2380.77         966.50   
chronic_respiratory_diseases                    3656.55        1689.00   
cirrhosis_and_other_chronic_liver_diseases      2470.60        1210.00   
digestive_diseases                              4303.08        2185.00   
fire,_heat,_and_hot_substances                   284.92         126.00   
acute_hepatitis                                  101.45          15.00   
total_no_of_deaths                            107820.60       50257.50   
meningitis_change                                -21.16          -1.00   
cumulative_deaths                            1483344.83      553431.50   

                                            std_before  mean_after  \
year                                              8.66     2004.07   
meningitis                                      784.30      173.58   
alzheimers_disease_and_other_dementias         2105.45      921.02   
parkinsons_disease                              513.54      221.39   
nutritional_deficiencies                       1085.38      246.26   
malaria                                         403.45      104.94   
drowning                                        574.04      203.85   
interpersonal_violence                          740.67      313.91   
maternal_disorders                              664.04      151.86   
hiv/aids                                       1790.61      494.25   
drug_use_disorders                              110.52       47.15   
tuberculosis                                   2691.19      631.03   
cardiovascular_diseases                       35091.24    14155.90   
lower_respiratory_infections                   8377.32     2273.97   
neonatal_disorders                             6560.24     1509.61   
alcohol_use_disorders                           260.79      134.16   
self-harm                                      1543.21      590.95   
exposure_to_forces_of_nature                     11.56        3.82   
diarrheal_diseases                             3707.42      838.16   
environmental_heat_and_cold_exposure             86.90       34.06   
neoplasms                                     16817.80     6906.92   
conflict_and_terrorism                           22.94        7.24   
diabetes_mellitus                              2417.35     1015.66   
chronic_kidney_disease                         2421.58      902.29   
poisonings                                      207.59       59.90   
protein-energy_malnutrition                     982.64      222.79   
road_injuries                                  2912.78     1025.76   
chronic_respiratory_diseases                   4472.84     1639.05   
cirrhosis_and_other_chronic_liver_diseases     2958.07     1043.32   
digestive_diseases                             5049.07     1859.30   
fire,_heat,_and_hot_substances                  350.75      118.69   
acute_hepatitis                                 143.79       32.77   
total_no_of_deaths                           128801.21    41609.82   
meningitis_change                               919.88       -2.10   
cumulative_deaths                           1879915.84   484839.98   

                                            median_after  std_after  
year                                             2004.00       8.65  
meningitis                                         32.00     395.37  
alzheimers_disease_and_other_dementias            275.00    1412.46  
parkinsons_disease                                 71.00     336.87  
nutritional_deficiencies                           17.00     612.31  
malaria                                             0.00     275.86  
drowning                                           66.00     341.15  
interpersonal_violence                             99.50     517.93  
maternal_disorders                                 11.00     361.25  
hiv/aids                                           37.00    1149.79  
drug_use_disorders                                  9.00      81.68  
tuberculosis                                       73.50    1409.98  
cardiovascular_diseases                          5170.00   21033.66  
lower_respiratory_infections                      678.00    4004.72  
neonatal_disorders                                242.00    3258.89  
alcohol_use_disorders                              25.00     205.64  
self-harm                                         212.00     933.31  
exposure_to_forces_of_nature                        0.00       8.42  
diarrheal_diseases                                 60.00    2051.25  
environmental_heat_and_cold_exposure                5.00      63.09  
neoplasms                                        2483.00   10548.30  
conflict_and_terrorism                              0.00      17.01  
diabetes_mellitus                                 481.50    1419.24  
chronic_kidney_disease                            316.50    1401.46  
poisonings                                         15.00     114.27  
protein-energy_malnutrition                        12.00     559.27  
road_injuries                                     395.50    1734.52  
chronic_respiratory_diseases                      592.50    2528.57  
cirrhosis_and_other_chronic_liver_diseases        340.50    1596.46  
digestive_diseases                                674.50    2719.22  
fire,_heat,_and_hot_substances                     41.00     198.72  
acute_hepatitis                                     4.00      76.06  
total_no_of_deaths                              18539.00   61383.73  
meningitis_change                                   0.00       5.94  
cumulative_deaths                              177374.00  654676.59  
In [ ]:
 

Modeling

In [416]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Linear Regression to predict future trends:

In [417]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 494300444.4112894
R-squared: 0.9708188257223398

Random Forest:

In [418]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_y_pred = rf_model.predict(X_test)

# Evaluate the model
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)

print(f"Random Forest Mean Squared Error: {rf_mse}")
print(f"Random Forest R-squared: {rf_r2}")
Random Forest Mean Squared Error: 131796235.34221615
Random Forest R-squared: 0.9922193699072206

Cross-Validation

Use cross-validation to assess model performance more reliably:

In [419]:
from sklearn.model_selection import cross_val_score

# Linear Regression with cross-validation
lr_model = LinearRegression()
lr_scores = cross_val_score(lr_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Linear Regression Mean Cross-Validated MSE: {-lr_scores.mean()}")

# Random Forest with cross-validation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Random Forest Mean Cross-Validated MSE: {-rf_scores.mean()}")
Linear Regression Mean Cross-Validated MSE: 679392109.7019784
Random Forest Mean Cross-Validated MSE: 637366769.3314102

Model Diagnostics

Examine residuals for Linear Regression to ensure they are randomly dispersed:

In [420]:
import matplotlib.pyplot as plt
import seaborn as sns

# Train Linear Regression Model
lr_model.fit(X_train, y_train)
y_train_pred = lr_model.predict(X_train)
residuals = y_train - y_train_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_train_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()

Feature Importance (Random Forest)

Evaluate feature importance to understand which features impact the model predictions:

In [421]:
import pandas as pd

# Train Random Forest Model
rf_model.fit(X_train, y_train)

# Get feature importances
importances = rf_model.feature_importances_
feature_names = X.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

# Plot feature importances
plt.figure(figsize=(12, 8))
feature_importances.plot(kind='bar')
plt.title('Feature Importances from Random Forest')
plt.show()

Evaluate on Test Data

In [422]:
# Linear Regression Evaluation
lr_model.fit(X_train, y_train)
y_test_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_test_pred_lr)
r2_lr = r2_score(y_test, y_test_pred_lr)
print(f"Linear Regression Mean Squared Error: {mse_lr}")
print(f"Linear Regression R-squared: {r2_lr}")

# Random Forest Evaluation
rf_y_pred = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, rf_y_pred)
r2_rf = r2_score(y_test, rf_y_pred)
print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")
Linear Regression Mean Squared Error: 494300444.4112894
Linear Regression R-squared: 0.9708188257223398
Random Forest Mean Squared Error: 131796235.34221615
Random Forest R-squared: 0.9922193699072206
In [ ]:
 
In [423]:
### Check for outliers:
In [632]:
# Select only numeric columns for analysis
numeric_df = df.select_dtypes(include=[np.number])

# Calculate Q1 (25th percentile) and Q3 (75th percentile) for numeric columns
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

# Detect outliers for each column (values outside of 1.5 * IQR)
outliers = (numeric_df < (Q1 - 1.5 * IQR)) | (numeric_df > (Q3 + 1.5 * IQR))

# Count the number of outliers for each column
outliers_count = outliers.sum()
print(outliers_count)
year                                             0
meningitis                                    1029
alzheimers_disease_and_other_dementias         819
parkinsons_disease                             811
nutritional_deficiencies                       950
malaria                                       1278
drowning                                       733
interpersonal_violence                         841
maternal_disorders                             789
hiv/aids                                      1041
drug_use_disorders                             725
tuberculosis                                   916
cardiovascular_diseases                        732
lower_respiratory_infections                   593
neonatal_disorders                             777
alcohol_use_disorders                          685
self-harm                                      722
exposure_to_forces_of_nature                  1025
diarrheal_diseases                             926
environmental_heat_and_cold_exposure           559
neoplasms                                      768
conflict_and_terrorism                        1188
diabetes_mellitus                              872
chronic_kidney_disease                         787
poisonings                                     580
protein-energy_malnutrition                    994
road_injuries                                  765
chronic_respiratory_diseases                   918
cirrhosis_and_other_chronic_liver_diseases     796
digestive_diseases                             812
fire,_heat,_and_hot_substances                 562
acute_hepatitis                                802
total_no_of_deaths                             712
meningitis_change                             1508
cumulative_deaths                              693
dtype: int64
In [634]:
# List of columns to plot
columns_to_plot = ['malaria', 'conflict_and_terrorism', 'diabetes_mellitus', 'cardiovascular_diseases']

# Plotting the boxplots for selected columns
plt.figure(figsize=(14, 8))

for i, col in enumerate(columns_to_plot, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(y=numeric_df[col])
    plt.title(f'Boxplot for {col}')

plt.tight_layout()
plt.show()
In [635]:
# Cap/floor outliers at the 5th and 95th percentile
lower_bound = numeric_df.quantile(0.05)
upper_bound = numeric_df.quantile(0.95)

# Applying the capping
capped_df = numeric_df.clip(lower=lower_bound, upper=upper_bound, axis=1)

# Check the capped dataframe
capped_df.head()
Out[635]:
year meningitis alzheimers_disease_and_other_dementias parkinsons_disease nutritional_deficiencies malaria drowning interpersonal_violence maternal_disorders hiv/aids ... protein-energy_malnutrition road_injuries chronic_respiratory_diseases cirrhosis_and_other_chronic_liver_diseases digestive_diseases fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths meningitis_change cumulative_deaths
0 1991 2159.00 1116.00 371.00 2087.00 93.00 1370.00 1538.00 2655.00 34.00 ... 2054.00 4154.00 5945.00 2673.00 5005.00 323.00 1569.05 147971.00 0.00 147971.00
1 1991 2218.00 1136.00 374.00 2153.00 189.00 1391.00 2001.00 2885.00 41.00 ... 2119.00 4472.00 6050.00 2728.00 5120.00 332.00 1569.05 156844.00 59.00 304815.00
2 1992 2475.00 1162.00 378.00 2441.00 239.00 1514.00 2299.00 3315.00 48.00 ... 2404.00 5106.00 6223.00 2830.00 5335.00 360.00 1569.05 169156.00 65.00 473971.00
3 1993 2812.00 1187.00 384.00 2837.00 108.00 1687.00 2589.00 3636.15 56.00 ... 2797.00 5681.00 6445.00 2943.00 5568.00 396.00 1569.05 182230.00 65.00 656201.00
4 1994 3027.00 1211.00 391.00 3081.00 211.00 1809.00 2849.00 3636.15 63.00 ... 3038.00 6001.00 6664.00 3027.00 5739.00 420.00 1569.05 194795.00 65.00 850996.00

5 rows × 35 columns

In [636]:
# Recalculate Q1 (25th percentile) and Q3 (75th percentile) for capped data
Q1_capped = capped_df.quantile(0.25)
Q3_capped = capped_df.quantile(0.75)

# Calculate the IQR for the capped data
IQR_capped = Q3_capped - Q1_capped

# Detect outliers for each column in the capped data (values outside of 1.5 * IQR)
outliers_capped = (capped_df < (Q1_capped - 1.5 * IQR_capped)) | (capped_df > (Q3_capped + 1.5 * IQR_capped))

# Count the number of outliers for each column in the capped data
outliers_capped_count = outliers_capped.sum()
print(outliers_capped_count)
year                                             0
meningitis                                    1029
alzheimers_disease_and_other_dementias         819
parkinsons_disease                             811
nutritional_deficiencies                       950
malaria                                       1278
drowning                                       733
interpersonal_violence                         841
maternal_disorders                             789
hiv/aids                                      1041
drug_use_disorders                             725
tuberculosis                                   916
cardiovascular_diseases                        732
lower_respiratory_infections                   593
neonatal_disorders                             777
alcohol_use_disorders                          685
self-harm                                      722
exposure_to_forces_of_nature                  1025
diarrheal_diseases                             926
environmental_heat_and_cold_exposure           559
neoplasms                                      768
conflict_and_terrorism                        1188
diabetes_mellitus                              872
chronic_kidney_disease                         787
poisonings                                     580
protein-energy_malnutrition                    994
road_injuries                                  765
chronic_respiratory_diseases                   918
cirrhosis_and_other_chronic_liver_diseases     796
digestive_diseases                             812
fire,_heat,_and_hot_substances                 562
acute_hepatitis                                802
total_no_of_deaths                             712
meningitis_change                             1508
cumulative_deaths                              693
dtype: int64
In [637]:
# Recalculate the 1st and 99th percentiles for more aggressive capping
lower_bound_strict = numeric_df.quantile(0.01)
upper_bound_strict = numeric_df.quantile(0.99)

# Apply stricter capping to the dataframe
capped_df_strict = numeric_df.clip(lower=lower_bound_strict, upper=upper_bound_strict, axis=1)

# Check the capped dataframe
print(capped_df_strict.head())
   year  meningitis  alzheimers_disease_and_other_dementias  \
0  1990     2159.00                                 1116.00   
1  1991     2218.00                                 1136.00   
2  1992     2475.00                                 1162.00   
3  1993     2812.00                                 1187.00   
4  1994     3027.00                                 1211.00   

   parkinsons_disease  nutritional_deficiencies  malaria  drowning  \
0              371.00                   2087.00    93.00   1370.00   
1              374.00                   2153.00   189.00   1391.00   
2              378.00                   2441.00   239.00   1514.00   
3              384.00                   2837.00   108.00   1687.00   
4              391.00                   3081.00   211.00   1809.00   

   interpersonal_violence  maternal_disorders  hiv/aids  ...  \
0                 1538.00             2655.00     34.00  ...   
1                 2001.00             2885.00     41.00  ...   
2                 2299.00             3315.00     48.00  ...   
3                 2589.00             3671.00     56.00  ...   
4                 2849.00             3863.00     63.00  ...   

   protein-energy_malnutrition  road_injuries  chronic_respiratory_diseases  \
0                      2054.00        4154.00                       5945.00   
1                      2119.00        4472.00                       6050.00   
2                      2404.00        5106.00                       6223.00   
3                      2797.00        5681.00                       6445.00   
4                      3038.00        6001.00                       6664.00   

   cirrhosis_and_other_chronic_liver_diseases  digestive_diseases  \
0                                     2673.00             5005.00   
1                                     2728.00             5120.00   
2                                     2830.00             5335.00   
3                                     2943.00             5568.00   
4                                     3027.00             5739.00   

   fire,_heat,_and_hot_substances  acute_hepatitis  total_no_of_deaths  \
0                          323.00          2985.00           147971.00   
1                          332.00          3092.00           156844.00   
2                          360.00          3325.00           169156.00   
3                          396.00          3601.00           182230.00   
4                          420.00          3816.00           194795.00   

   meningitis_change  cumulative_deaths  
0               0.00          147971.00  
1              59.00          304815.00  
2             257.00          473971.00  
3             337.00          656201.00  
4             215.00          850996.00  

[5 rows x 35 columns]
In [638]:
# Recalculate Q1 and Q3 for capped data (after stricter capping)
Q1_strict = capped_df_strict.quantile(0.25)
Q3_strict = capped_df_strict.quantile(0.75)

# Calculate the IQR for the capped data
IQR_strict = Q3_strict - Q1_strict

# Detect outliers for each column (values outside of 1.5 * IQR)
outliers_strict = (capped_df_strict < (Q1_strict - 1.5 * IQR_strict)) | (capped_df_strict > (Q3_strict + 1.5 * IQR_strict))

# Count the number of outliers for each column in the strictly capped data
outliers_strict_count = outliers_strict.sum()
print(outliers_strict_count)
year                                             0
meningitis                                    1029
alzheimers_disease_and_other_dementias         819
parkinsons_disease                             811
nutritional_deficiencies                       950
malaria                                       1278
drowning                                       733
interpersonal_violence                         841
maternal_disorders                             789
hiv/aids                                      1041
drug_use_disorders                             725
tuberculosis                                   916
cardiovascular_diseases                        732
lower_respiratory_infections                   593
neonatal_disorders                             777
alcohol_use_disorders                          685
self-harm                                      722
exposure_to_forces_of_nature                  1025
diarrheal_diseases                             926
environmental_heat_and_cold_exposure           559
neoplasms                                      768
conflict_and_terrorism                        1188
diabetes_mellitus                              872
chronic_kidney_disease                         787
poisonings                                     580
protein-energy_malnutrition                    994
road_injuries                                  765
chronic_respiratory_diseases                   918
cirrhosis_and_other_chronic_liver_diseases     796
digestive_diseases                             812
fire,_heat,_and_hot_substances                 562
acute_hepatitis                                802
total_no_of_deaths                             712
meningitis_change                             1508
cumulative_deaths                              693
dtype: int64
In [639]:
# Apply log transformation to columns where values are greater than 0 (log can only be applied to positive numbers)
log_transformed_df = capped_df_strict.apply(lambda x: np.log(x + 1) if (x > 0).all() else x)

# Recalculate Q1 and Q3 after log transformation
Q1_log = log_transformed_df.quantile(0.25)
Q3_log = log_transformed_df.quantile(0.75)

# Calculate IQR after log transformation
IQR_log = Q3_log - Q1_log

# Detect outliers after log transformation
outliers_log = (log_transformed_df < (Q1_log - 1.5 * IQR_log)) | (log_transformed_df > (Q3_log + 1.5 * IQR_log))

# Count the number of outliers after log transformation
outliers_log_count = outliers_log.sum()
print(outliers_log_count)
year                                             0
meningitis                                    1029
alzheimers_disease_and_other_dementias         819
parkinsons_disease                             811
nutritional_deficiencies                       950
malaria                                       1278
drowning                                       733
interpersonal_violence                         841
maternal_disorders                             789
hiv/aids                                      1041
drug_use_disorders                             725
tuberculosis                                   916
cardiovascular_diseases                          0
lower_respiratory_infections                     0
neonatal_disorders                             777
alcohol_use_disorders                          685
self-harm                                        0
exposure_to_forces_of_nature                  1025
diarrheal_diseases                             926
environmental_heat_and_cold_exposure           559
neoplasms                                        0
conflict_and_terrorism                        1188
diabetes_mellitus                              124
chronic_kidney_disease                           0
poisonings                                     580
protein-energy_malnutrition                    994
road_injuries                                    0
chronic_respiratory_diseases                     0
cirrhosis_and_other_chronic_liver_diseases       0
digestive_diseases                               0
fire,_heat,_and_hot_substances                 562
acute_hepatitis                                802
total_no_of_deaths                              62
meningitis_change                             1508
cumulative_deaths                               72
dtype: int64
In [640]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize outliers using boxplots
plt.figure(figsize=(15, 10))
sns.boxplot(data=df[['meningitis', 'alzheimers_disease_and_other_dementias', 'parkinsons_disease',
                     'nutritional_deficiencies', 'malaria', 'drowning', 'interpersonal_violence']])
plt.xticks(rotation=90)
plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [641]:
import numpy as np

# Identify all numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Visualize outliers for each numeric column using boxplot
for col in numeric_cols:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

# Handling outliers by capping at 1.5 * IQR for all numeric columns
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
In [642]:
# Compute summary statistics before handling outliers
stats_before = df.describe()

# Copy the original dataframe to handle outliers
df_before = df.copy()

# Handle outliers by capping using IQR
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

# Compute summary statistics after handling outliers
stats_after = df.describe()

# Compare before and after statistics
stats_comparison = pd.concat([stats_before, stats_after], axis=1, keys=['Before', 'After'])
print(stats_comparison)
       Before                                                    \
         year meningitis alzheimers_disease_and_other_dementias   
count 6120.00    6120.00                                6120.00   
mean  2004.50     558.62                                1677.19   
std      8.66     784.30                                2105.45   
min   1990.00       0.00                                   0.00   
25%   1997.00      15.00                                  90.00   
50%   2004.50     109.00                                 666.50   
75%   2012.00     847.25                                2456.25   
max   2019.00    2095.62                                6005.62   

                                                                    \
      parkinsons_disease nutritional_deficiencies malaria drowning   
count            6120.00                  6120.00 6120.00  6120.00   
mean              412.24                   739.49  245.05   468.94   
std               513.54                  1085.38  403.45   574.04   
min                 0.00                     0.00    0.00     0.00   
25%                27.00                     9.00    0.00    34.00   
50%               164.00                   119.00    0.00   177.00   
75%               609.25                  1167.25  393.00   698.00   
max              1482.62                  2904.62  982.50  1694.00   

                                                          ...  \
      interpersonal_violence maternal_disorders hiv/aids  ...   
count                6120.00            6120.00  6120.00  ...   
mean                  612.05             453.71  1198.91  ...   
std                   740.67             664.04  1790.61  ...   
min                     0.00               0.00     0.00  ...   
25%                    40.00               5.00    11.00  ...   
50%                   265.00              54.00   136.00  ...   
75%                   877.00             734.00  1879.00  ...   
max                  2132.50            1827.50  4681.00  ...   

                            After                                             \
      protein-energy_malnutrition road_injuries chronic_respiratory_diseases   
count                     6120.00       6120.00                      6120.00   
mean                       659.48       2380.77                      3656.55   
std                        982.64       2912.78                      4472.84   
min                          0.00          0.00                         1.00   
25%                          5.00        174.75                       289.00   
50%                         92.00        966.50                      1689.00   
75%                       1042.50       3435.25                      5249.75   
max                       2598.75       8326.00                     12690.88   

                                                                     \
      cirrhosis_and_other_chronic_liver_diseases digestive_diseases   
count                                    6120.00            6120.00   
mean                                     2470.60            4303.08   
std                                      2958.07            5049.07   
min                                         0.00               0.00   
25%                                       154.00             284.00   
50%                                      1210.00            2185.00   
75%                                      3547.25            6080.00   
max                                      8637.12           14774.00   

                                                                         \
      fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths   
count                        6120.00         6120.00            6120.00   
mean                          284.92          101.45          107820.60   
std                           350.75          143.79          128801.21   
min                             0.00            0.00               7.00   
25%                            17.00            2.00            6935.00   
50%                           126.00           15.00           50257.50   
75%                           450.00          160.00          158221.00   
max                          1099.50          397.00          385150.00   

                                           
      meningitis_change cumulative_deaths  
count           6120.00           6120.00  
mean              -5.15        1483344.83  
std               12.56        1879915.84  
min              -27.50             13.00  
25%              -11.00          71995.25  
50%               -1.00         553431.50  
75%                0.00        2266613.50  
max               16.50        5558540.88  

[8 rows x 70 columns]

Robust Scaling

Use RobustScaler to scale features in a way that's robust to outliers.

In [643]:
from sklearn.preprocessing import RobustScaler

# Initialize RobustScaler
scaler = RobustScaler()

# Fit and transform the numeric columns
numeric_df_scaled = pd.DataFrame(scaler.fit_transform(numeric_df), columns=numeric_df.columns)

# Check the results
print(numeric_df_scaled.describe())
         year  meningitis  alzheimers_disease_and_other_dementias  \
count 6120.00     6120.00                                 6120.00   
mean     0.00        1.94                                    1.77   
std      0.58        8.02                                    7.70   
min     -0.97       -0.13                                   -0.28   
25%     -0.50       -0.11                                   -0.24   
50%      0.00        0.00                                    0.00   
75%      0.50        0.89                                    0.76   
max      0.97      118.05                                  135.26   

       parkinsons_disease  nutritional_deficiencies  malaria  drowning  \
count             6120.00                   6120.00  6120.00   6120.00   
mean                 1.73                      1.84    10.54      2.27   
std                  7.93                      9.05    46.89     13.37   
min                 -0.28                     -0.10     0.00     -0.27   
25%                 -0.24                     -0.09     0.00     -0.22   
50%                  0.00                      0.00     0.00      0.00   
75%                  0.76                      0.91     1.00      0.78   
max                131.95                    231.47   714.01    231.32   

       interpersonal_violence  maternal_disorders  hiv/aids  ...  \
count                 6120.00             6120.00   6120.00  ...   
mean                     2.17                1.66      3.11  ...   
std                      8.26                8.31     11.25  ...   
min                     -0.32               -0.07     -0.07  ...   
25%                     -0.27               -0.07     -0.07  ...   
50%                      0.00                0.00      0.00  ...   
75%                      0.73                0.93      0.93  ...   
max                     82.89              147.98    163.47  ...   

       protein-energy_malnutrition  road_injuries  \
count                      6120.00        6120.00   
mean                          1.81           1.52   
std                           7.96           7.39   
min                          -0.09          -0.30   
25%                          -0.08          -0.24   
50%                           0.00           0.00   
75%                           0.92           0.76   
max                         194.84         100.68   

       chronic_respiratory_diseases  \
count                       6120.00   
mean                           3.11   
std                           21.20   
min                           -0.34   
25%                           -0.28   
50%                            0.00   
75%                            0.72   
max                          275.03   

       cirrhosis_and_other_chronic_liver_diseases  digestive_diseases  \
count                                     6120.00             6120.00   
mean                                         1.45                1.47   
std                                          6.10                6.42   
min                                         -0.36               -0.38   
25%                                         -0.31               -0.33   
50%                                          0.00                0.00   
75%                                          0.69                0.67   
max                                         79.22               79.84   

       fire,_heat,_and_hot_substances  acute_hepatitis  total_no_of_deaths  \
count                         6120.00          6120.00             6120.00   
mean                             1.07             3.82                1.25   
std                              4.92            26.49                5.78   
min                             -0.29            -0.09               -0.33   
25%                             -0.25            -0.08               -0.29   
50%                              0.00             0.00                0.00   
75%                              0.75             0.92                0.71   
max                             59.47           406.90               68.69   

       meningitis_change  cumulative_deaths  
count            6120.00            6120.00  
mean               -1.83               1.42  
std                83.63               7.05  
min              -975.18              -0.25  
25%                -0.91              -0.22  
50%                 0.00               0.00  
75%                 0.09               0.78  
max              4848.55             120.68  

[8 rows x 35 columns]

Explore Different Thresholds

Apply capping at different percentiles (e.g., 95th percentile) to see if it’s a better fit.

In [644]:
# Capping at the 95th percentile
upper_cap_95 = numeric_df.quantile(0.95)
numeric_df_capped_95 = numeric_df.clip(upper=upper_cap_95, axis=1)

# Visualizing the capped data at the 95th percentile
plt.figure(figsize=(15, 10))
sns.boxplot(data=numeric_df_capped_95[['meningitis', 'alzheimers_disease_and_other_dementias', 'parkinsons_disease']])
plt.title("After 95th Percentile Capping")
plt.xticks(rotation=90)
plt.show()

# Summary statistics after 95th percentile capping
print("\nAfter 95th Percentile Capping:")
print(numeric_df_capped_95[['meningitis', 'alzheimers_disease_and_other_dementias', 'parkinsons_disease']].describe())
After 95th Percentile Capping:
       meningitis  alzheimers_disease_and_other_dementias  parkinsons_disease
count     6120.00                                 6120.00             6120.00
mean       918.75                                 2791.50              665.43
std       1654.04                                 5128.79             1192.28
min          0.00                                    0.00                0.00
25%         15.00                                   90.00               27.00
50%        109.00                                  666.50              164.00
75%        847.25                                 2456.25              609.25
max       6110.10                                20386.30             4707.15

Visualize More Columns

Visualize additional columns to ensure comprehensive outlier handling.

In [645]:
# Visualize more columns
plt.figure(figsize=(20, 15))
sns.boxplot(data=numeric_df[['malaria', 'drowning', 'interpersonal_violence', 'maternal_disorders', 
                             'hiv/aids', 'drug_use_disorders', 'tuberculosis', 'cardiovascular_diseases', 
                             'lower_respiratory_infections', 'neonatal_disorders']])
plt.xticks(rotation=90)
plt.title("Boxplots of Additional Columns")
plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Random Forest Regressor model

In [648]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor 
from sklearn.metrics import mean_squared_error, r2_score

# Assuming you want to predict total number of deaths
X = numeric_df_capped_95.drop(columns=['total_no_of_deaths'])
y = numeric_df_capped_95['total_no_of_deaths']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a Random Forest model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
Mean Squared Error: 289725277.01695436
R^2 Score: 0.9938607881978867

This shows which variables are most influential in predicting the target.

In [649]:
importances = model.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
print(feature_importance_df.sort_values(by='importance', ascending=False))
                                       feature  importance
29                          digestive_diseases        0.87
30              fire,_heat,_and_hot_substances        0.06
12                     cardiovascular_diseases        0.02
24                                  poisonings        0.01
22                           diabetes_mellitus        0.01
16                                   self-harm        0.01
6                                     drowning        0.01
27                chronic_respiratory_diseases        0.00
14                          neonatal_disorders        0.00
13                lower_respiratory_infections        0.00
9                                     hiv/aids        0.00
7                       interpersonal_violence        0.00
20                                   neoplasms        0.00
18                          diarrheal_diseases        0.00
23                      chronic_kidney_disease        0.00
33                           cumulative_deaths        0.00
28  cirrhosis_and_other_chronic_liver_diseases        0.00
19        environmental_heat_and_cold_exposure        0.00
11                                tuberculosis        0.00
25                 protein-energy_malnutrition        0.00
15                       alcohol_use_disorders        0.00
4                     nutritional_deficiencies        0.00
5                                      malaria        0.00
26                               road_injuries        0.00
0                                         year        0.00
1                                   meningitis        0.00
31                             acute_hepatitis        0.00
10                          drug_use_disorders        0.00
2       alzheimers_disease_and_other_dementias        0.00
3                           parkinsons_disease        0.00
8                           maternal_disorders        0.00
21                      conflict_and_terrorism        0.00
32                           meningitis_change        0.00
17                exposure_to_forces_of_nature        0.00

Cross-Validation:

To ensure that the model is robust, use cross-validation to evaluate its performance across multiple subsets of the data.

In [650]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f'Cross-validated R² scores: {cv_scores}')
print(f'Mean CV R² score: {cv_scores.mean()}')
Cross-validated R² scores: [0.98481804 0.91487192 0.95665849 0.977272   0.9252078 ]
Mean CV R² score: 0.9517656470976534

Log Transformation (if the target is skewed):

If the target variable (total_no_of_deaths) is highly skewed, applying a log transformation can help reduce the impact of outliers.

In [ ]:
import numpy as np

y_log = np.log1p(y)  # Apply log transformation

Feature Selection:

Remove features with zero or negligible importance and re-run the model. This will reduce the dimensionality of the dataset and focus on the most impactful features.

In [ ]:
# Dropping features with low importance
low_importance_features = feature_importance_df[feature_importance_df['importance'] == 0]['feature']
X_reduced = X.drop(columns=low_importance_features)

# Re-run train-test split and model
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y_log, test_size=0.2, random_state=42)
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
In [ ]:
numeric_df.columns
In [ ]:
 
In [ ]:
df.head()

OBSERVATIONS (CHINA , INDIA AND USA) face the largest brunt of deaths due to diseases in the world Cardiovascular diseases , Neoplasms (Malignancy/Cancer) and Lower Respiratory Tract Infections (for example : Pneumonia) are the top 3 killer disases in the world.

In [ ]:
 

Transformations¶

Log Transformation¶

Log transformations can help reduce the effect of large outliers.

In [424]:
import numpy as np

# Apply log transformation to relevant columns
transformed_df = df.copy()
columns_to_transform = ['meningitis', 'alzheimers_disease_and_other_dementias', 
                        'parkinsons_disease', 'nutritional_deficiencies', 
                        'malaria', 'drowning', 'interpersonal_violence', 
                        'maternal_disorders', 'hiv/aids', 'chronic_kidney_disease', 
                        'poisonings', 'protein-energy_malnutrition', 'road_injuries', 
                        'chronic_respiratory_diseases', 'cirrhosis_and_other_chronic_liver_diseases', 
                        'digestive_diseases', 'fire,_heat,_and_hot_substances', 'acute_hepatitis']

for col in columns_to_transform:
    transformed_df[col] = np.log1p(transformed_df[col])  # log1p is used to handle zero values

transformed_df.describe()
Out[424]:
year meningitis alzheimers_disease_and_other_dementias parkinsons_disease nutritional_deficiencies malaria drowning interpersonal_violence maternal_disorders hiv/aids ... protein-energy_malnutrition road_injuries chronic_respiratory_diseases cirrhosis_and_other_chronic_liver_diseases digestive_diseases fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths meningitis_change cumulative_deaths
count 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 ... 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00 6120.00
mean 2004.50 4.55 6.01 4.76 4.55 2.29 4.91 5.13 4.05 4.86 ... 4.30 6.39 6.88 6.44 7.04 4.36 3.00 107820.60 -21.16 1483344.83
std 8.66 2.39 2.23 1.99 2.57 2.96 1.99 2.10 2.53 2.71 ... 2.65 2.26 2.16 2.25 2.23 2.08 2.11 128801.21 919.88 1879915.84
min 1990.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.69 0.00 0.00 0.00 0.00 7.00 -10728.00 13.00
25% 1997.00 2.77 4.51 3.33 2.30 0.00 3.56 3.71 1.79 2.48 ... 1.79 5.17 5.67 5.04 5.65 2.89 1.10 6935.00 -11.00 71995.25
50% 2004.50 4.70 6.50 5.11 4.79 0.00 5.18 5.58 4.01 4.92 ... 4.53 6.87 7.43 7.10 7.69 4.84 2.77 50257.50 -1.00 553431.50
75% 2012.00 6.74 7.81 6.41 7.06 5.98 6.55 6.78 6.60 7.54 ... 6.95 8.14 8.57 8.17 8.71 6.11 5.08 158221.00 0.00 2266613.50
max 2019.00 7.65 8.70 7.30 7.97 6.89 7.44 7.67 7.51 8.45 ... 7.86 9.03 9.45 9.06 9.60 7.00 5.99 385150.00 53333.00 5558540.88

8 rows × 35 columns

In [ ]:
 
In [ ]:
 
In [ ]:
 

Bonus: Predictive Analysis¶

For the bonus section, we will use a simple machine learning model to predict future deaths based on the historical data.

In [426]:
# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Linear Regression Model

In [427]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Evaluate the model
mae_train = mean_absolute_error(y_train, y_pred_train)
mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)

mae_test = mean_absolute_error(y_test, y_pred_test)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

# Print evaluation metrics
print("Train Set Evaluation:")
print(f"MAE: {mae_train:.2f}, MSE: {mse_train:.2f}, R²: {r2_train:.2f}")

print("\nTest Set Evaluation:")
print(f"MAE: {mae_test:.2f}, MSE: {mse_test:.2f}, R²: {r2_test:.2f}")
Train Set Evaluation:
MAE: 12220.00, MSE: 368377986.99, R²: 0.98

Test Set Evaluation:
MAE: 13204.36, MSE: 494300444.41, R²: 0.97

Random Foret Regressor

In [428]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_train_rf = rf_model.predict(X_train)
y_pred_test_rf = rf_model.predict(X_test)

# Evaluate the model
mae_train_rf = mean_absolute_error(y_train, y_pred_train_rf)
mse_train_rf = mean_squared_error(y_train, y_pred_train_rf)
r2_train_rf = r2_score(y_train, y_pred_train_rf)

mae_test_rf = mean_absolute_error(y_test, y_pred_test_rf)
mse_test_rf = mean_squared_error(y_test, y_pred_test_rf)
r2_test_rf = r2_score(y_test, y_pred_test_rf)

# Print evaluation metrics
print("Random Forest Train Set Evaluation:")
print(f"MAE: {mae_train_rf:.2f}, MSE: {mse_train_rf:.2f}, R²: {r2_train_rf:.2f}")

print("\nRandom Forest Test Set Evaluation:")
print(f"MAE: {mae_test_rf:.2f}, MSE: {mse_test_rf:.2f}, R²: {r2_test_rf:.2f}")
Random Forest Train Set Evaluation:
MAE: 761.16, MSE: 5878023.11, R²: 1.00

Random Forest Test Set Evaluation:
MAE: 2436.79, MSE: 130860626.72, R²: 0.99

Polynomial Regression

In [429]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create polynomial features (e.g., degree 2 for quadratic regression)
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

# Initialize and train the model
poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train)

# Predict on the training and test sets
y_poly_train_pred = poly_model.predict(X_poly_train)
y_poly_test_pred = poly_model.predict(X_poly_test)

# Evaluate the model
mse_poly_train = mean_squared_error(y_train, y_poly_train_pred)
r2_poly_train = r2_score(y_train, y_poly_train_pred)

mse_poly_test = mean_squared_error(y_test, y_poly_test_pred)
r2_poly_test = r2_score(y_test, y_poly_test_pred)

# Print evaluation metrics
print("Polynomial Regression Train Set Evaluation:")
print(f"MAE: {mean_absolute_error(y_train, y_poly_train_pred):.2f}, MSE: {mse_poly_train:.2f}, R²: {r2_poly_train:.2f}")

print("\nPolynomial Regression Test Set Evaluation:")
print(f"MAE: {mean_absolute_error(y_test, y_poly_test_pred):.2f}, MSE: {mse_poly_test:.2f}, R²: {r2_poly_test:.2f}")
Polynomial Regression Train Set Evaluation:
MAE: 2752.13, MSE: 26848266.57, R²: 1.00

Polynomial Regression Test Set Evaluation:
MAE: 3968.86, MSE: 113650941.75, R²: 0.99

Time Series Modeling with ARIMA

In [ ]:
 
In [430]:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Prepare the data for ARIMA model (set 'year' as index)
df.set_index('year', inplace=True)

# Train ARIMA model (order can be adjusted based on data)
arima_model = ARIMA(df['total_no_of_deaths'], order=(1, 1, 1))  # You can tune (p,d,q) parameters
arima_model_fit = arima_model.fit()

# Print model summary
print(arima_model_fit.summary())

# Make predictions
y_pred_arima = arima_model_fit.forecast(steps=len(X_test))  # Forecast future values based on test set size

# Evaluate ARIMA model
mse_arima = mean_squared_error(y_test, y_pred_arima)
r2_arima = r2_score(y_test, y_pred_arima)

print("\nARIMA Test Set Evaluation:")
print(f"MSE: {mse_arima:.2f}, R²: {r2_arima:.2f}")
D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

An unsupported index was provided and will be ignored when e.g. forecasting.

D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

An unsupported index was provided and will be ignored when e.g. forecasting.

D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

An unsupported index was provided and will be ignored when e.g. forecasting.

                               SARIMAX Results                                
==============================================================================
Dep. Variable:     total_no_of_deaths   No. Observations:                 6120
Model:                 ARIMA(1, 1, 1)   Log Likelihood              -72382.157
Date:                Tue, 17 Sep 2024   AIC                         144770.314
Time:                        23:26:41   BIC                         144790.472
Sample:                             0   HQIC                        144777.307
                               - 6120                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.1644      3.681     -0.045      0.964      -7.379       7.050
ma.L1          0.1456      3.682      0.040      0.968      -7.070       7.362
sigma2      1.103e+09   1.99e-07   5.55e+15      0.000     1.1e+09     1.1e+09
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):           1866416.09
Prob(Q):                              0.99   Prob(JB):                         0.00
Heteroskedasticity (H):               0.94   Skew:                            -1.55
Prob(H) (two-sided):                  0.18   Kurtosis:                        88.50
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 4.88e+30. Standard errors may be unstable.

ARIMA Test Set Evaluation:
MSE: 17168807798.80, R²: -0.01
D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning:

No supported index is available. Prediction results will be given with an integer index beginning at `start`.

D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: FutureWarning:

No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.

In [ ]:
 

Feature Engineering¶

In [435]:
# Load the dataset
df = pd.read_csv('cause_of_deaths.csv')
In [440]:
cause_of_deaths = ['Meningitis',
       'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis']
In [441]:
# Creating a new column for 'Total_no_of_Deaths' for individual Country and Year

df['Total_no_of_Deaths'] = df[cause_of_deaths].sum(axis=1)
In [ ]:
 
In [437]:
df['year_squared'] = df['Year'] ** 2

year2 = df.sort_values(by='year_squared',ascending=False)[:10][['year_squared','Year']]

year2
Out[437]:
year_squared Year
6119 4076361 2019
1649 4076361 2019
5489 4076361 2019
4349 4076361 2019
2729 4076361 2019
2039 4076361 2019
4379 4076361 2019
659 4076361 2019
5459 4076361 2019
4409 4076361 2019
In [438]:
# 1. Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['Year']])
In [442]:
# 2. Adding Log Transformation
df['log_total_no_of_deaths'] = np.log1p(df['Total_no_of_Deaths'])
In [443]:
# 3. Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)
In [444]:
# Prepare the features and target variable
X = pd.DataFrame(X_scaled, columns=poly.get_feature_names_out())
y = df['log_total_no_of_deaths']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [445]:
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the results
print("Linear Regression Model with Feature Engineering:")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
Linear Regression Model with Feature Engineering:
Mean Squared Error: 5.948201145565115
R^2 Score: 0.00042561720103073686
In [447]:
# Plotting predictions
plt.scatter(X_test['Year'], y_test, color='blue', label='Actual')
plt.scatter(X_test['Year'], np.expm1(y_pred), color='red', label='Predicted')
plt.xlabel('Year')
plt.ylabel('Total Number of Deaths')
plt.title('Actual vs Predicted')
plt.legend()
plt.show()

MSE Improvement: The Mean Squared Error has improved significantly compared to previous models, indicating that the feature engineering steps helped in reducing prediction error.

Breakdown of Feature Engineering Steps¶
Polynomial Features:¶

Adding polynomial features allows the model to capture non-linear relationships between the features and the target variable. In your case, you used a quadratic transformation (degree=2) of the year feature. Log Transformation:

Applying a log transformation to the target variable helps in stabilizing the variance and handling skewness. The transformation np.log1p(df['total_no_of_deaths']) is useful when the data contains large ranges or outliers. Standardization:

Standardizing the features (StandardScaler) ensures that each feature has a mean of 0 and a standard deviation of 1. This is especially useful when combining polynomial features and scaling to ensure proper model convergence.

In [448]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and Train Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Regression Model:")
print(f"Mean Squared Error: {mse_rf}")
print(f"R^2 Score: {r2_rf}")
Random Forest Regression Model:
Mean Squared Error: 5.976878671752126
R^2 Score: -0.0043935406984889624
In [449]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# Example for Ridge Regression
param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid_search = GridSearchCV(Ridge(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print("Tuned Ridge Regression Model:")
print(f"Mean Squared Error: {mse_best}")
print(f"R^2 Score: {r2_best}")
Best parameters found:  {'alpha': 0.1}
Tuned Ridge Regression Model:
Mean Squared Error: 5.944337099032542
R^2 Score: 0.0010749566960110979
In [450]:
importances = rf_model.feature_importances_
feature_names = X.columns
sorted_indices = importances.argsort()[::-1]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[sorted_indices], align="center")
plt.xticks(range(X.shape[1]), feature_names[sorted_indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()
In [453]:
# Adding polynomial features and log transformation
df['year_squared'] = df['Year'] ** 2
df['log_total_no_of_deaths'] = np.log1p(df['Total_no_of_Deaths'])

# Prepare features and target variable
X = df[['Year', 'year_squared']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X = pd.DataFrame(X_scaled, columns=['Year', 'year_squared'])
y = df['log_total_no_of_deaths']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the results
print("Linear Regression Model with Feature Engineering:")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")

# Plot predictions
plt.scatter(X_test['Year'], np.expm1(y_test), color='blue', label='Actual')
plt.scatter(X_test['Year'], np.expm1(y_pred), color='red', label='Predicted')
plt.xlabel('Year')
plt.ylabel('Total Number of Deaths')
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Linear Regression Model with Feature Engineering:
Mean Squared Error: 5.948201145565117
R^2 Score: 0.00042561720103029277
In [454]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Grid search for Ridge Regression
param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid_search = GridSearchCV(Ridge(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print("Tuned Ridge Regression Model:")
print(f"Mean Squared Error: {mse_best}")
print(f"R^2 Score: {r2_best}")
Best parameters found:  {'alpha': 0.1}
Tuned Ridge Regression Model:
Mean Squared Error: 5.944337099032542
R^2 Score: 0.0010749566960110979

Ridge Regression: Regularized model with hyperparameter tuning.

In [455]:
df.head().T
Out[455]:
0 1 2 3 4
Country/Territory Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan
Code AFG AFG AFG AFG AFG
Year 1990 1991 1992 1993 1994
Meningitis 2159 2218 2475 2812 3027
Alzheimer's Disease and Other Dementias 1116 1136 1162 1187 1211
Parkinson's Disease 371 374 378 384 391
Nutritional Deficiencies 2087 2153 2441 2837 3081
Malaria 93 189 239 108 211
Drowning 1370 1391 1514 1687 1809
Interpersonal Violence 1538 2001 2299 2589 2849
Maternal Disorders 2655 2885 3315 3671 3863
HIV/AIDS 34 41 48 56 63
Drug Use Disorders 93 102 118 132 142
Tuberculosis 4661 4743 4976 5254 5470
Cardiovascular Diseases 44899 45492 46557 47951 49308
Lower Respiratory Infections 23741 24504 27404 31116 33390
Neonatal Disorders 15612 17128 20060 22335 23288
Alcohol Use Disorders 72 75 80 85 88
Self-harm 696 751 855 943 993
Exposure to Forces of Nature 0 1347 614 225 160
Diarrheal Diseases 4235 4927 6123 8174 8215
Environmental Heat and Cold Exposure 175 113 38 41 44
Neoplasms 11580 11796 12218 12634 12914
Conflict and Terrorism 1490 3370 4344 4096 8959
Diabetes Mellitus 2108 2120 2153 2195 2231
Chronic Kidney Disease 3709 3724 3776 3862 3932
Poisonings 338 351 386 425 451
Protein-Energy Malnutrition 2054 2119 2404 2797 3038
Road Injuries 4154 4472 5106 5681 6001
Chronic Respiratory Diseases 5945 6050 6223 6445 6664
Cirrhosis and Other Chronic Liver Diseases 2673 2728 2830 2943 3027
Digestive Diseases 5005 5120 5335 5568 5739
Fire, Heat, and Hot Substances 323 332 360 396 420
Acute Hepatitis 2985 3092 3325 3601 3816
year_squared 3960100 3964081 3968064 3972049 3976036
Total_no_of_Deaths 147971 156844 169156 182230 194795
log_total_no_of_deaths 11.90 11.96 12.04 12.11 12.18
In [459]:
# Using global death trends for prediction
X = df[['Year']]
y = df['Total_no_of_Deaths']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [457]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
In [460]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(gb_model, X, y, cv=5)
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean CV Score: {cv_scores.mean()}')
In [ ]:
from sklearn.metrics import confusion_matrix, classification_report

conf_matrix = confusion_matrix(y_test, gb_predictions)
print('Confusion Matrix:\n', conf_matrix)
print('Classification Report:\n', classification_report(y_test, gb_predictions))
In [ ]:
 

EXTRA WORK¶

In [461]:
from IPython.display import Image, display

# Display an image from a URL
image_url = 'https://th.bing.com/th/id/OIP.MRqC4PFoXLAaZz4nzDowiQHaE8?rs=1&pid=ImgDetMain'
display(Image(url=image_url))

As we live in Egypt, I will be focusing on analysing data for my country¶

In [462]:
df=pd.read_csv("cause_of_deaths.csv")
In [463]:
# Create a new data frame of New Zealand

Egypt_df = df[df['Country/Territory'] == 'Egypt']

Egypt_df.head()
Out[463]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Diabetes Mellitus Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis
1620 Egypt EGY 2008 1138 5212 1732 886 0 1025 503 ... 13100 15535 175 736 25929 18612 48127 53614 1378 1689
1621 Egypt EGY 2009 1137 5340 1796 889 0 1051 535 ... 13951 16199 180 734 26837 19088 49630 55290 1415 1657
1622 Egypt EGY 2010 1111 5464 1847 874 0 1070 535 ... 14715 16806 183 718 27409 19409 50816 56588 1447 1605
1623 Egypt EGY 2011 1115 5607 1898 871 0 1093 562 ... 15183 17195 186 713 28205 19685 51945 57828 1462 1568
1624 Egypt EGY 2014 925 5999 2053 842 0 1051 593 ... 16886 18893 183 677 28405 20849 55221 61427 1474 1443

5 rows × 34 columns

Top 10 cause of deaths in Egypt¶

In [464]:
cause_of_deaths = ['Meningitis',
       'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis']
# Creating a new column for 'Total_no_of_Deaths' for individual Country and Year

Egypt_df['Total_no_of_Deaths'] = Egypt_df[cause_of_deaths].sum(axis=1)
C:\Users\sama\AppData\Local\Temp\ipykernel_17528\3892502345.py:16: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [465]:
# Find the total number of each disease in EGYPT

EG_disease = Egypt_df[cause_of_deaths].sum().to_frame().reset_index()
EG_disease.rename(columns = {'index': 'Diseases', 0:'Total_no_of_Deaths'}, inplace = True)
EG_disease
Out[465]:
Diseases Total_no_of_Deaths
0 Meningitis 39101
1 Alzheimer's Disease and Other Dementias 146785
2 Parkinson's Disease 49207
3 Nutritional Deficiencies 30709
4 Malaria 0
5 Drowning 33681
6 Interpersonal Violence 11933
7 Maternal Disorders 34917
8 HIV/AIDS 2784
9 Drug Use Disorders 1366
10 Tuberculosis 35759
11 Cardiovascular Diseases 5995471
12 Lower Respiratory Infections 954868
13 Neonatal Disorders 504806
14 Alcohol Use Disorders 3349
15 Self-harm 70777
16 Exposure to Forces of Nature 1572
17 Diarrheal Diseases 498193
18 Environmental Heat and Cold Exposure 1735
19 Neoplasms 1160639
20 Conflict and Terrorism 7542
21 Diabetes Mellitus 370494
22 Chronic Kidney Disease 445949
23 Poisonings 5649
24 Protein-Energy Malnutrition 26381
25 Road Injuries 796157
26 Chronic Respiratory Diseases 543660
27 Cirrhosis and Other Chronic Liver Diseases 1422257
28 Digestive Diseases 1583081
29 Fire, Heat, and Hot Substances 42655
30 Acute Hepatitis 56882
In [466]:
# Find the top 10 cause of deaths in Egypt

Top10_EG_diseases = EG_disease.sort_values(by='Total_no_of_Deaths',ascending = False).head(10)

Top10_EG_diseases
Out[466]:
Diseases Total_no_of_Deaths
11 Cardiovascular Diseases 5995471
28 Digestive Diseases 1583081
27 Cirrhosis and Other Chronic Liver Diseases 1422257
19 Neoplasms 1160639
12 Lower Respiratory Infections 954868
25 Road Injuries 796157
26 Chronic Respiratory Diseases 543660
13 Neonatal Disorders 504806
17 Diarrheal Diseases 498193
22 Chronic Kidney Disease 445949
In [467]:
# Create a bar chart of Top 10 cause of deaths in Egypt

plt.figure(figsize=(12,8))

sns.barplot(data = Top10_EG_diseases, x = 'Total_no_of_Deaths', y = 'Diseases', color = 'Blue')

# Add some text for labels, title 
plt.xlabel('Total Number of Deaths', fontsize = 15)
plt.ylabel('Diseases', fontsize = 15)
plt.title('Top 10 cause of deaths in EGYPT during 1990-2019', fontsize =15)
Out[467]:
Text(0.5, 1.0, 'Top 10 cause of deaths in EGYPT during 1990-2019')
In [468]:
# Create Treemap

fig = px.treemap(EG_disease, 
                 path = [px.Constant('Total_no_of_Deaths'), 'Diseases'], 
                 values = 'Total_no_of_Deaths'
                 )

fig.update_traces(textinfo='label+percent parent')    
fig.update_layout(title_text='Percentage of cause of deaths in EGYPT during 1990-2019', title_x=0.5, font_size=15)
fig.show()

Time Series of total number of deaths in Egypt¶

In [469]:
Egypt_df.columns
Out[469]:
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
       'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
       'Total_no_of_Deaths'],
      dtype='object')
In [470]:
# Find the total number of deaths in Egypt group by year

EG_Deaths_by_year = Egypt_df.groupby('Year')['Total_no_of_Deaths'].sum().reset_index()

EG_Deaths_by_year
Out[470]:
Year Total_no_of_Deaths
0 1990 468409
1 1991 457493
2 1992 447129
3 1993 448294
4 1994 448394
5 1995 438552
6 1996 432618
7 1997 435796
8 1998 436987
9 1999 438946
10 2000 429426
11 2001 448267
12 2002 463366
13 2003 481773
14 2004 483542
15 2005 481658
16 2006 485863
17 2007 488266
18 2008 499896
19 2009 514724
20 2010 523236
21 2011 529436
22 2012 544796
23 2013 540252
24 2014 551643
25 2015 581847
26 2016 587922
27 2017 588017
28 2018 596617
29 2019 605194
In [471]:
# Create line chart

plt.figure(figsize=(12,6))

sns.lineplot(data = EG_Deaths_by_year, x='Year', y = 'Total_no_of_Deaths')

plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time Series of Total Number of Deaths in EGYPT', fontsize=15)
Out[471]:
Text(0.5, 1.0, 'Time Series of Total Number of Deaths in EGYPT')
In [472]:
# Create Time series of top 5 cause of deaths in EGYPT

top5_diseases = ["Cardiovascular Diseases", 
                 "Neoplasms", 
                 "Chronic Respiratory Diseases", 
                 "Alzheimer's Disease and Other Dementias", 
                 "Digestive Diseases"]

plt.figure(figsize=(12,8))

for i in top5_diseases:
    sns.lineplot(data = Egypt_df, 
                 x = 'Year', 
                 y = Egypt_df[i],
                 label = i
                )
    
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time Series of top 5 cause of deaths in EGYPT', fontsize=15)
Out[472]:
Text(0.5, 1.0, 'Time Series of top 5 cause of deaths in EGYPT')

Cause of Deaths in EGYPT in 2019¶

The latest year from this dataset is 2019.

So I would like to know the latest information of cause of deaths in EGYPT

In [473]:
Egypt_df.tail()
Out[473]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis Total_no_of_Deaths
1645 Egypt EGY 2012 1102 5746 1989 880 0 1104 618 ... 18053 189 717 28813 20395 53786 59861 1497 1542 544796
1646 Egypt EGY 2013 1082 5816 1964 844 0 1095 684 ... 18037 187 684 28515 20325 53648 59739 1481 1481 540252
1647 Egypt EGY 2017 823 6469 2296 826 0 1005 522 ... 20956 182 650 29308 21957 59798 66243 1494 1397 588017
1648 Egypt EGY 2018 790 6681 2366 817 0 988 516 ... 21461 180 639 29391 22235 61156 67659 1492 1373 596617
1649 Egypt EGY 2019 764 6918 2439 816 0 972 512 ... 21981 179 634 29490 22560 62635 69216 1496 1356 605194

5 rows × 35 columns

In [474]:
# Create a new data frame of Egypt year 2019

EG_2019 = Egypt_df[Egypt_df['Year'] == 2019]

EG_2019
Out[474]:
Country/Territory Code Year Meningitis Alzheimer's Disease and Other Dementias Parkinson's Disease Nutritional Deficiencies Malaria Drowning Interpersonal Violence ... Chronic Kidney Disease Poisonings Protein-Energy Malnutrition Road Injuries Chronic Respiratory Diseases Cirrhosis and Other Chronic Liver Diseases Digestive Diseases Fire, Heat, and Hot Substances Acute Hepatitis Total_no_of_Deaths
1649 Egypt EGY 2019 764 6918 2439 816 0 972 512 ... 21981 179 634 29490 22560 62635 69216 1496 1356 605194

1 rows × 35 columns

In [475]:
# Find the total number of each disease in Egypt in 2019

disease_2019 = EG_2019[cause_of_deaths].sum().to_frame().reset_index()
disease_2019.rename(columns={'index': 'Diseases', 0:'Total_deaths'}, inplace=True)
disease_2019
Out[475]:
Diseases Total_deaths
0 Meningitis 764
1 Alzheimer's Disease and Other Dementias 6918
2 Parkinson's Disease 2439
3 Nutritional Deficiencies 816
4 Malaria 0
5 Drowning 972
6 Interpersonal Violence 512
7 Maternal Disorders 751
8 HIV/AIDS 56
9 Drug Use Disorders 82
10 Tuberculosis 892
11 Cardiovascular Diseases 263873
12 Lower Respiratory Infections 21371
13 Neonatal Disorders 5336
14 Alcohol Use Disorders 150
15 Self-harm 3105
16 Exposure to Forces of Nature 0
17 Diarrheal Diseases 8474
18 Environmental Heat and Cold Exposure 42
19 Neoplasms 57934
20 Conflict and Terrorism 682
21 Diabetes Mellitus 20478
22 Chronic Kidney Disease 21981
23 Poisonings 179
24 Protein-Energy Malnutrition 634
25 Road Injuries 29490
26 Chronic Respiratory Diseases 22560
27 Cirrhosis and Other Chronic Liver Diseases 62635
28 Digestive Diseases 69216
29 Fire, Heat, and Hot Substances 1496
30 Acute Hepatitis 1356

Top 5 cause of deaths in EGYPT in 2019¶

In [476]:
# Find Top 5 cause of deaths in EGYPT in 2019

top5_2019 = disease_2019.groupby('Diseases')['Total_deaths'].sum().sort_values(ascending=False).head(5).reset_index()

top5_2019
Out[476]:
Diseases Total_deaths
0 Cardiovascular Diseases 263873
1 Digestive Diseases 69216
2 Cirrhosis and Other Chronic Liver Diseases 62635
3 Neoplasms 57934
4 Road Injuries 29490
In [477]:
# Create bar chart of Top 5 Cause of Deaths in EGYPT in 2019

plt.figure(figsize=(12,6))

sns.barplot(data = top5_2019, x = 'Total_deaths', y = 'Diseases', color = 'Blue')

plt.xlabel('Total Number of Deaths', fontsize = 12)
plt.ylabel('Cause of Deaths', fontsize = 12)
plt.title('Top 5 Cause of Deaths in EGYPT in 2019', fontsize =15)
Out[477]:
Text(0.5, 1.0, 'Top 5 Cause of Deaths in EGYPT in 2019')
In [478]:
# Try to create pie chart

fig, ax = plt.subplots()

ax.pie(top5_2019['Total_deaths'], labels= top5_2019['Diseases'], autopct='%1.1f%%')

ax.set_title('Top 5 Cause of Deaths in EGYPT in 2019', fontsize =15)
Out[478]:
Text(0.5, 1.0, 'Top 5 Cause of Deaths in EGYPT in 2019')
In [479]:
# Create Treemap

fig = px.treemap(disease_2019, 
                 path = [px.Constant('Total_deaths'), 'Diseases'], 
                 values = 'Total_deaths'
                 )

fig.update_traces(textinfo='label+percent parent')    
fig.update_layout(title_text='Percentage of Cause of Deaths in EGYPT in 2019', title_x=0.5, font_size=15)
fig.show()

Time series of data not related to disease in EGYPT

I excluded the data of column 'Road Injuries' and 'Self-harm'.

Because the range of the data will be too high, resulting in the line chart being too wide and hard to read.

In [480]:
interest_data = ['Drowning', 
                 'Interpersonal Violence', 
                 'Drug Use Disorders',
                 'Alcohol Use Disorders',
                 'Environmental Heat and Cold Exposure', 
                 'Fire, Heat, and Hot Substances',
                 'Poisonings']

plt.figure(figsize=(16,9))

for i in interest_data:
    sns.lineplot(data = Egypt_df, 
                 x = 'Year', 
                 y = Egypt_df[i],
                 label = i
                )
    
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time Series of Data Not Related to Disease in EGYPT', fontsize=15)
Out[480]:
Text(0.5, 1.0, 'Time Series of Data Not Related to Disease in EGYPT')
In [481]:
# Bar Chart: Adding labels on bars for total deaths
plt.figure(figsize=(12,8))
sns.barplot(data=Top10_EG_diseases, x='Total_no_of_Deaths', y='Diseases', color='Blue')

# Add labels on bars
for i in range(Top10_EG_diseases.shape[0]):
    plt.text(Top10_EG_diseases['Total_no_of_Deaths'].values[i], i, f'{Top10_EG_diseases["Total_no_of_Deaths"].values[i]:,}', va='center')

# Add some text for labels and title
plt.xlabel('Total Number of Deaths', fontsize=15)
plt.ylabel('Diseases', fontsize=15)
plt.title('Top 10 Causes of Deaths in Egypt (1990-2019)', fontsize=15)
Out[481]:
Text(0.5, 1.0, 'Top 10 Causes of Deaths in Egypt (1990-2019)')
In [482]:
# Time Series of Top 5 Causes
plt.figure(figsize=(12,8))
for i in top5_diseases:
    sns.lineplot(data=Egypt_df, x='Year', y=Egypt_df[i], label=i, marker='o', markersize=5)

plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Number of Deaths', fontsize=12)
plt.title('Time Series of Top 5 Causes of Deaths in Egypt', fontsize=15)
plt.legend(loc='upper left')
plt.grid(True)
In [483]:
# Cause of deaths in Egypt in 2019
Egypt_2019 = Egypt_df[Egypt_df['Year'] == 2019]
latest_deaths = Egypt_2019[cause_of_deaths].sum().to_frame().reset_index()
latest_deaths.rename(columns={'index': 'Diseases', 0: 'Total_no_of_Deaths'}, inplace=True)
latest_deaths.sort_values(by='Total_no_of_Deaths', ascending=False, inplace=True)
latest_deaths.head(10)  # Display the top 10 causes of death in 2019
Out[483]:
Diseases Total_no_of_Deaths
11 Cardiovascular Diseases 263873
28 Digestive Diseases 69216
27 Cirrhosis and Other Chronic Liver Diseases 62635
19 Neoplasms 57934
25 Road Injuries 29490
26 Chronic Respiratory Diseases 22560
22 Chronic Kidney Disease 21981
12 Lower Respiratory Infections 21371
21 Diabetes Mellitus 20478
17 Diarrheal Diseases 8474
In [484]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example of linear regression
X = Egypt_df[['Cardiovascular Diseases', 'Neoplasms', 'Chronic Respiratory Diseases', 'Alzheimer\'s Disease and Other Dementias', 'Digestive Diseases']]
y = Egypt_df['Total_no_of_Deaths']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))
Mean Squared Error: 13111629.863343896
R2 Score: 0.995502477370117
In [ ]:
 
In [ ]:
 
In [ ]: